Interactive sessions stuck at starting stage

Hi there,

It appears our on demand interactive apps are stuck at the starting phase.

The jobs run, I can verify that the apps actually work outside of the dashboard. But it never gives a link to the app. It just stalls at this point.

Was wondering if I can find any pointers on how to fix this issue?

Thank you for such a cool tool!

RC

Can you share the output.log generated from the session? You can get to output.log by clicking on the Session ID in your Interactive Sessions list.

output.log will be in a directory like like this:

Hey Mario,

I don’t have the output.log in the file explorer when I click the session id.

To clarify a bit more, our login node can read/write to the $HOME directories. The compute nodes can only read from the $HOME directories, but their writes happen through an overlayfs mount.

I had to login to the compute node to get the output.log:

cat output.log 
/var/spool/slurm/d/job1572404/slurm_script: line 3: module: command not found
Script starting...
Waiting for Jupyter server to open port 7449...
TIMING - Starting main script at: Mon May 18 09:48:34 PDT 2020
TIMING - Starting jupyter at: Mon May 18 09:48:34 PDT 2020
+ jupyter-lab --config=/home/rcwhite/ondemand/data/sys/dashboard/batch_connect/dev/jupyter_test/output/694741ad-a664-4eee-924a-c0db3dd9961b/config.py
[W 09:48:35.689 LabApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[W 09:48:35.700 LabApp] JupyterLab server extension not enabled, manually loading...
[I 09:48:35.707 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.6/site-packages/jupyterlab
[I 09:48:35.707 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab
[I 09:48:35.711 LabApp] Serving notebooks from local directory: /home/rcwhite
[I 09:48:35.711 LabApp] The Jupyter Notebook is running at:
[I 09:48:35.711 LabApp] http://(maz044 or 127.0.0.1):7449/node/maz044/7449/
[I 09:48:35.711 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Discovered Jupyter server listening on port 7449!
Generating connection YAML file...

Would the type of configuration we have cause the apps to stall? Is the home directory used as an IPC of sort?

We have beegfs shares that is writable by all system, just our home directory is the one that has this caveat.

Make sure that compute nodes are syncing their writes properly. connection.yml is generated on the compute node directly. If writes from compute nodes aren’t being synced quickly enough to your file system, then that would explain why the login node is stalled because its waiting for connection.yml to exist.

The home directory is similar to IPC, for example the login node is looking for /jupyter_test/output/694741ad-a664-4eee-924a-c0db3dd9961b/connection.yml

Hi Mario,

Thank you for the clarification. I had figured it was something along that lines. The group I work with will meet next Friday and discuss setting up read/write on home directory in the cluster as a whole.