VNC connection to exec host in fully containerized cluster

Greetings!

I’m hoping for a little contextual insight into getting OOD client to connect to a noVNC job. I’m still in the POC stages of development for our cluster. I have Job Composer and Jupyter Notebook running successfully, but noVNC based image continue to be problematic. I’m trying to find that one “rosetta stone” like Interactive App to make this all come together.

Import cluster specific information:

Our cluster runs LSF10 where every job must run inside a docker container. So any OOD requirements must exist within the image that the submitted job runs. I’m currently attempting to set up Relion3.1 to run as this is a currently supported application on our production cluster. If there’s a better application to start with, I’m all ears.

On our cluster, job submission from a ssh terminal on a client host looks like this (grossly simplified):
LSF_DOCKER_PORTS=‘8080:8080’ LSF_DOCKER_NETWORK=host LSF_DOCKER_IPC=host bsub -Is -R ‘select[port8080=1] rusage[mem=4GB]’ -q general-interactive -G compute-ris -a ‘docker(us.gcr.io/ris-appeng-shared-dev/relion31-ood)’ supervisord -c /app/supervisord.conf

“relion31-ood” is an image I’ve created with the websockify and TurboVNC OOD requirements built in. On our cluster, once the job is up, browser connection is:

https://<host_fqdn>:8080/vnc.html

I have this configured in /var/www/ood/apps/sys/relion3 such that the job submits via batch_connect: vnc and runs but the port remains unavailable (according to the logs). I also notice that ood_core/batch_connect/templates/vnc.rb specifically starts vncserver on a different port, so maybe my use case is out of sync with how batch_connect expects the server to connect?

I’m probably just missing something obvious here in how OOD vnc expects to be able to connect to the host. Any help would be greatly appreciated.

Thanks!

Hey @shawn.m.leonard, can you provide more logs about the port being unavailable? Are you generating the noVNC URL yourself? Double check that you’re passing all of the required parameters that noVNC needs: https://sourcegraph.com/github.com/OSC/ondemand/-/blob/apps/dashboard/app/helpers/batch_connect/sessions_helper.rb#L184

noVNC is client side only, the browser itself needs to connect to WebSockify directly.

If you want a deeper look into how noVNC works in OnDemand, here’s a thread from the other day: Connecting to static VNC server

That deeper look is VERY helpful. Thank you for that. Its pretty clear the main problem here is that I’m not fully grasping the full system while I’m trying to tailor it to our requirements.

Here’s the output log, though now I’m thinking I misinterpreted the log port information. I still have an “after” script configured for timeout just to keep jobs from sitting idle:

[97c7657a-46f2-4e9f-9e88-f185de86f8d5]$ cat output.log
latest: Pulling from ris-appeng-shared-dev/relion31-ood
Digest: sha256:ab4d7ca9475c5c0b2abcda41ff49bd6346db58294972c1f7f40ccf1f682600cf
Status: Image is up to date for us.gcr.io/ris-appeng-shared-dev/relion31-ood:latest
us.gcr.io/ris-appeng-shared-dev/relion31-ood:latest
WARNING: Published ports are discarded when using host network mode
Setting VNC password…
Starting VNC server…

Desktop ‘TurboVNC: compute-sleonard-exec-4.c.ris-sleonard.internal:1 (shawn.m.leonard)’ started on display compute-sleonard-exec-4.c.ris-sleonard.internal:1

Log file is vnc.log
Successfully started VNC server on compute-sleonard-exec-4.c.ris-sleonard.internal:5901…
Script starting…
Waiting for server to open port 8700…
TIMING - Starting wait at: Tue Dec 8 17:39:02 UTC 2020
TIMING - Starting main script at: Tue Dec 8 17:39:02 UTC 2020
TIMING - Starting at: Tue Dec 8 17:39:02 UTC 2020

  • supervisord -c /app/supervisord.conf
    2020-12-08 17:39:03,405 INFO Included extra file “/app/conf.d/fluxbox.conf” during parsing
    2020-12-08 17:39:03,405 INFO Included extra file “/app/conf.d/websockify.conf” during parsing
    2020-12-08 17:39:03,405 INFO Included extra file “/app/conf.d/x11vnc.conf” during parsing
    2020-12-08 17:39:03,405 INFO Included extra file “/app/conf.d/xterm.conf” during parsing
    2020-12-08 17:39:03,405 INFO Included extra file “/app/conf.d/xvfb.conf” during parsing
    2020-12-08 17:39:03,409 INFO supervisord started with pid 72
    2020-12-08 17:39:04,412 INFO spawned: ‘xvfb’ with pid 110
    2020-12-08 17:39:04,417 INFO spawned: ‘x11vnc’ with pid 111
    2020-12-08 17:39:04,419 INFO spawned: ‘fluxbox’ with pid 112
    2020-12-08 17:39:04,423 INFO spawned: ‘websockify’ with pid 113
    2020-12-08 17:39:04,431 INFO spawned: ‘xterm’ with pid 114
    2020-12-08 17:39:05,663 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2020-12-08 17:39:05,663 INFO success: x11vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2020-12-08 17:39:05,663 INFO success: fluxbox entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2020-12-08 17:39:05,663 INFO success: websockify entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2020-12-08 17:39:05,663 INFO success: xterm entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    Timed out waiting for server to open port 8700!
    TIMING - Wait ended at: Tue Dec 8 17:40:06 UTC 2020

so for this example:

#{websockify_cmd} -D ${websocket} localhost:${port}

websocket should be 8700 and port 5901? My submit.yml.erb selects a random port in the 8000s and sets it in the bsub for LSF to use with docker run. Is there an env var to set for that ruby code to know to use it?

echo "Starting websocket server..."
websocket=$(find_port)
#{websockify_cmd} -D ${websocket} localhost:${port} ``

I think its find_port that I need to find a way to circumvent. I have an env var with the port already available inside the container, I just need a way for that ruby code to use it instead of the results of find_port.

hmm. although. connection.yml looks good:
b19fb7b5-7b5a-4f0e-b65b-9520c57115e1]$ cat connection.yml
my_new_port: 8719
host: compute-sleonard-exec-4.c.ris-sleonard.internal
port: 8719
password: 4M35bZjh
display: 1
websocket: 25923
spassword: r4u0H3yK

Well, headway I guess. I can get the job running on the host with a button created, which launches another browser tab that does show a noVNC window, but it “Failed to connect to server”. Here’s the URL the button points to:

https://<client_host>/pun/sys/dashboard/noVNC-1.1.0/vnc.html?utf8=%E2%9C%93&autoconnect=true&path=rnode%2F<exec_host>%2F22793%2Fwebsockify&resize=remote&password=RKov4wP9&compressionsetting=6&qualitysetting=2&commit=Launch+Relion3

looks ok to my untrained eyes. I had to cheese the process a little and run the websockify_cmd in my template/script.sh.erb to get around the “find_port” issue.

on further review, I’m returning to port mismatch. If websockify is the way from the client -> vncserver it can’t get there because the server started on port 5901 but the docker container’s exposed port is 8719. websockify is mapped correctly into the container but the server inside is not running on that port. I don’t see a way to tell vncserver which port to use when it starts.

I worked with @jeff.ohrstrom offline and found the solution. I’m sharing here in case it helps anyone else in the future.
The key problem was in following the port mappings from client -> websockify -> vncserver. As stated, our cluster runs all submitted jobs to the cluster in docker images, so the application config must generate a port before the job runs so that docker can open the port to the container running on the exec node. The key point was forcing the “find_port” function in session.rb to use this port (let’s call it “my_new_port”). To do so, the batch_connect config needs this:

batch_connect:
  template: "vnc"
  min_port: <%= my_new_port %>
  max_port: <%= my_new_port %>

With this set, and “my_new_port” exported in before.sh.erb, websockify starts with the correct port mapping through docker.