Jupyter interactive app question - launching, but not providing link to

Hello,

I installed Open OnDemand 1.5 on a CentOS 7 system last week, and have been pretty successful getting it to work with our cluster and LSF scheduler so far.

I’ve gotten stuck, however, in trying to get the example Jupyter app working correctly. I’ve followed the instructions installing it in my test account’s developer space. I can launch the Jupyter app, and as best I can tell everything works, save that the web interface never seems to provide a link or any information prompting the user to be able to connect to the launched application.

After selecting ‘Launch’ from the ‘/pun/sys/dashboard/batch_connect/dev/jupyter/session_contexts/new’ url, the new page loads, and reads, “Session was successfully created.” followed by “Home / My Interactive Sessions”, the ‘Interactive Apps [Sandbox]’ subwindow with ‘Jupyter Notebook’ in it, and “You have no active sessions.”:

The job itself is accepted by the scheduler, dispatches to a node, and starts jupyter up seemingly fine:

[output.log]
 Script starting...
 Waiting for Jupyter Notebook server to open port 53550...
 TIMING - Starting wait at: Fri Jul 12 16:43:13 EDT 2019
 TIMING - Starting main script at: Fri Jul 12 16:43:13 EDT 2019
<snip>
 Discovered Jupyter Notebook server listening on port 53550!
 TIMING - Wait ended at: Fri Jul 12 16:43:17 EDT 2019
 Generating connection YAML file...

It does create a connection.yml file:

[connection.yml]
host: c05b06
port: 53550
password: *************

Using the fields from the connection.yml file to manually go to the (in this example) ‘/node/c05b06/53550/’ url, I’m prompted for the jupyter password, and once entered the jupyter notebook comes up with no issue, being proxied as I imagine is expected.

My ood_portal.yml file contains

---
ssl:
  - 'SSLCertificateFile "/etc/pki/tls/certs/***.crt"'
  - 'SSLCertificateKeyFile "/etc/pki/tls/private/***.key"'
auth:
  - 'AuthType Basic'
  - 'AuthName "private"'
  - 'AuthUserFile "/opt/rh/httpd24/root/etc/httpd/.htpasswd"'
  - 'RequestHeader unset Authorization'
  - 'Require valid-user'
host_regex: '[^/]+'
node_uri: '/node'
rnode_uri: '/rnode'

I’m not entirely sure what is supposed to happen after the jupyter job is launched, but I expect something would come up to tell the user that their interactive app has been submitted and that once it’s running, the interface will provide a link or similar for them to click to connect to it via the proxy.

I’ve tried wiping out my install, and starting from scratch again, but haven’t had any luck in figuring out where I’ve gone wrong. Any suggestions for things to check would be appreciated. Thank you very much.

You will want to see something like this appear under the success message:

What version of LSF are you using?

Those session cards display based on loading files, serialized in json, stored in $HOME/ondemand/data/sys/dashboard/batch_connect/db/

In those files there will be two values:

"cluster_id":"owens",
"job_id":"7273828.owens-batch.ten.osc.edu"

The job_id is the value returned from parsing the output of bsub command, and needs to be the same id string passed to bjobs to get the status for the job. If this is not working and dashboard app is trying to check the status of the job but the job appears to not exist (and thus “completed”) the card will disappear. BTW the fact that the card doesn’t stick around is a bad design that has yet to be fixed.

You may be facing a similar problem another LSF site experienced: https://github.com/OSC/ood_core/issues/81#issuecomment-380241311

The issue turned out that we were calling the job status on a host group and not a cluster. So the delay was because the job was Pending and had yet to be dispatched to a host in the requested host group. Once the job was dispatched to a valid host under the requested host group and entered the Running state it would appear in the job status request.

Maybe this is the problem you are facing? OnDemand wants to submit a job to a host using bsub, get the id, use that id to immediately check the status of it using bjobs, and continue to check the status of it using bjobs till the job completes.

1 Like

Thank you for the response. We’re running LSF 9.1.3 currently (and working on migrating to 10.x).

It looks like the problem is a delay between when bsub returns the jobid for a new job, and when the bjobs command starts returning that jobid’s status. I’d been aware of this gap, though it’s usually less than a second and can run longer, obviously it’s too long, and OnDemand considers the job to have failed since it doesn’t return for that id immediately.

I made a wrapper script for the bjobs command to have it sleep for a few seconds before running the actual ‘bjobs’. Once that was in place in the cluster.yml file, my next attempt brought up the Notebook window as per your example. Thanks very much!

I’ll have to see how best to address this long-term, but at least in the short term I’m able to move forward. Thank you again!