First of all, the OOD platform is really impressive - nice work OSC!
I’m sure it’s something simple, but I get problems when starting a VNC session on LSF manager.
When starting it from the OOD interface, all the processes related to VNC get started.
However, in the frontend it keeps saying “Your session is currently starting… Please be patient as this process can take a few minutes.”, and the “Launch Desktop” button will never appear.
I tried it with SLURM and it worked, while on LSF there is this issue.
What am I missing?
Hi @rotaugenlaubfrosch, what does your
$HOME/ondemand/data/sys/dashboard/batch_connect/sys/$VNC_APP_SUBDIR/output/$UUID/output.log say? Or the
This sounds similar to an issue I had when setting up OOD with LSF, in my case it was a race condition, if LSF didn’t show a submitted job via ‘bjobs’ immediately after submission, the startup procedure would never continue.
My solution was to make a wrapper script for the bjobs command to have it sleep for a few seconds before running the actual ‘bjobs’ command, then pointed to it in cluster.yml:
job: bin_overrides: bjobs: "/path/to/bjobs/wrapper"
The delay was enough so that the launched job was found, and the startup worked afterwards:
Thank you very much for your answer - this looks promising.
Do you have an example of how the wrapper script looks like?
The wrapper I used is pretty straightforward:
#!/bin/bash # Wrapper to sleep bjobs before running SLEEP=5 OPERATION=/lsf/9.1/linux2.6-glibc2.3-x86_64/bin/bjobs # Run sleep $SLEEP exec $OPERATION "$@"
I created a wrapper script as you posted it but it doesn’t seem to help.
In the wrapper script, bjob gets executed correctly and the running job is visible.
However, OOD says that the job is starting, although it is already running on the node.
I noticed that there is an ajax request every ~10 seconds, probably to check the status of the submitted job. The wrapper script also gets executed every 10 seconds.
Do you know how OOD checks if the job turned from pending to running?
Please look in your log directory for an indication, and post/share any relevant info from that log.
It would be somewhere like this.
$VNC_APP_SUBDIR is maybe lsf_poc_desktop and
$UUID is the
ce25e64b...that you’ve just shared.
Also just to clarify the situation, when you start this job, you’re saying that it sits in the
starting state forever? Or does it sit in that state then eventually delete itself?
Also during this state, what does LSF itself say about the job? I mean from LSF’s perspective, what state is the job in, like