Flaky Jupyter Notebook sandbox app behavior

Using the docs I installed a Jupyter app but it’s very unreliable whether it will start. I can log in and create a new python3 notebook, but usually I get the brown ‘connecting to kernel’ button in the upper left and it fails to connect. In the web browser I see many messages such as

kernel.js:465 WebSocket connection to 'wss://carina-ondemand.usc.edu/node/carina-10-123-0-3/43741/api/kernels/78146f6b-6494-42c4-ade1-376d6e2617b1/channels?session_id=b1faef4b44834b2f81843e141e6faf96' failed: Error during WebSocket handshake: Unexpected response code: 503
Kernel.start_channels @ kernel.js:465
Kernel.reconnect @ kernel.js:359
i @ jquery.min.js:2
kernel.js:106 Kernel: kernel_disconnected (78146f6b-6494-42c4-ade1-376d6e2617b1)
kernel.js:546 WebSocket connection failed:  wss://carina-ondemand.usc.edu/node/carina-10-123-0-3/43741/api/kernels/78146f6b-6494-42c4-ade1-376d6e2617b1 true

Often, I can log out and then log back in and I can connect to a saved notebook fine. Other times, the app will connect to the kernel but when I log out and log back in it won’t connect.

Starting up I get messages similar to

Script starting...
Waiting for Jupyter Notebook server to open port 43741...
TIMING - Starting main script at: Sat Mar  6 19:30:40 UTC 2021
TIMING - Starting wait at: Sat Mar  6 19:30:40 UTC 2021
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'python'
No Modulefiles Currently Loaded.
TIMING - Starting jupyter at: Sat Mar  6 19:30:40 UTC 2021
+ jupyter notebook --config=/home/christay/ondemand/data/sys/dashboard/batch_connect/dev/jupyter/output/197fc270-5132-43ec-ab73-9491e1dddd26/config.py
[W 19:30:41.859 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 19:30:41.865 NotebookApp] Serving notebooks from local directory: /home/christay
[I 19:30:41.865 NotebookApp] Jupyter Notebook 6.2.0 is running at:
[I 19:30:41.865 NotebookApp] http://carina-10-123-0-3:43741/node/carina-10-123-0-3/43741/
[I 19:30:41.865 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Discovered Jupyter Notebook server listening on port 43741!
TIMING - Wait ended at: Sat Mar  6 19:30:42 UTC 2021
Generating connection YAML file...
[I 19:30:57.708 NotebookApp] 302 POST /node/carina-10-123-0-3/43741/login (10.123.0.7) 2.690000ms
[I 19:30:57.963 NotebookApp] 302 GET /node/carina-10-123-0-3/43741/ (10.123.0.7) 0.930000ms
[I 19:31:56.058 NotebookApp] Creating new notebook in
[I 19:32:00.513 NotebookApp] Kernel started: f6da1793-920a-4201-ab33-53c1bbc2a75f, name: python3
[I 19:32:55.788 NotebookApp] Saving file at /jupytertest1.ipynb
[I 19:33:02.499 NotebookApp] Starting buffering for f6da1793-920a-4201-ab33-53c1bbc2a75f:39a5363159f2436d8d993d99e7f720a4
[W 19:33:08.145 NotebookApp] Forbidden
[W 19:33:08.146 NotebookApp] 403 GET /node/carina-10-123-0-3/43741/api/sessions?_=1615059062031 (10.123.0.7) 2.410000ms referer=https://carina-ondemand.usc.edu/node/carina-10-123-0-3/43741/tree?
[W 19:33:08.146 NotebookApp] Forbidden

I think name resolution is working fine, my OnDemand server can resolve the (unqualified) hostname of the compute node running the notebook server. Even if I start a desktop and web browser on the actual node running the notebook server process and log in through OnDemand I get errors connecting to the kernel.

I don’t have any host firewalls running or selinux enabled.

I’m utterly at a loss where to try to start to troubleshoot- I can’t seem to find any pattern in when it works and when it doesn’t. Can someone please suggest how to troubleshoot? Thanks.

Hi everyone- can anyone give me any ideas about what other information I can provide to try to track down this intermittent problem? Thanks

I think I would start here. Is the user environment complete when that is called?

We are seeing some intermittent issues like this where the module system is not available. We think our problem is some sort of a race condition that is due to us having a remote file system for home (~ca. 3 km away) for some of our clusters.

That’s interesting- thanks. I do have my home directory NFS mounted. But, this is a small test cluster with minimal latency between nodes, and I’m not using a module system- python3 is just in my user’s path. Is it a requirement to load python as a module?

Here’s a strange thing, I get the unable to locate modulefile error even with a perfectly fine run of starting an interactive Notebook session. Here is a run where the interactive app started and ran fine:

$ cat output.log
Script starting...
Waiting for Jupyter Notebook server to open port 8071...
TIMING - Starting main script at: Sat Mar 20 06:05:33 UTC 2021
TIMING - Starting wait at: Sat Mar 20 06:05:33 UTC 2021
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'python'
No Modulefiles Currently Loaded.
TIMING - Starting jupyter at: Sat Mar 20 06:05:33 UTC 2021
+ jupyter notebook --config=/home/christay/ondemand/data/sys/dashboard/batch_connect/dev/jupyter/output/e0b930ac-2ac6-41c9-9354-dbdb1c3afa72/config.py
[W 06:05:35.567 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 06:05:35.574 NotebookApp] Serving notebooks from local directory: /home/christay
[I 06:05:35.574 NotebookApp] Jupyter Notebook 6.2.0 is running at:
[I 06:05:35.574 NotebookApp] http://carina-10-123-0-3:8071/node/carina-10-123-0-3/8071/
[I 06:05:35.574 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Discovered Jupyter Notebook server listening on port 8071!
TIMING - Wait ended at: Sat Mar 20 06:05:35 UTC 2021
Generating connection YAML file...
[I 06:05:52.073 NotebookApp] 302 POST /node/carina-10-123-0-3/8071/login (10.123.0.7) 3.000000ms
[I 06:05:52.311 NotebookApp] 302 GET /node/carina-10-123-0-3/8071/ (10.123.0.7) 0.960000ms
[I 06:06:02.239 NotebookApp] Creating new notebook in
[I 06:06:07.794 NotebookApp] Kernel started: d7250af7-0ada-4e6d-9735-c5b4fff36328, name: python3
[I 06:06:24.505 NotebookApp] Starting buffering for d7250af7-0ada-4e6d-9735-c5b4fff36328:195b31fb0b7a41558708f352a6775885
[I 06:06:24.814 NotebookApp] Saving file at /Untitled6.ipynb
[W 06:06:28.215 NotebookApp] Forbidden
[W 06:06:28.216 NotebookApp] 403 GET /node/carina-10-123-0-3/8071/api/sessions?_=1616220355328 (10.123.0.7) 3.390000ms referer=https://carina-ondemand.usc.edu/node/carina-10-123-0-3/8071/tree?
[W 06:06:28.217 NotebookApp] Forbidden
[W 06:06:28.218 NotebookApp] 403 GET /node/carina-10-123-0-3/8071/api/terminals?_=1616220355329 (10.123.0.7) 3.780000ms referer=https://carina-ondemand.usc.edu/node/carina-10-123-0-3/8071/tree?
slurmstepd: error: *** JOB 133 ON carina-10-123-0-3 CANCELLED AT 2021-03-20T06:06:37 ***

What do your configs look like? specifically allow_origin and disable_check_xsrf. I’d hone in on the 403 unauthorized. Maybe there’s some setting that’s off of somthing in the middle stripping the authorization header?

Thanks for your suggestions. Here is what is in the ondemand/dev/jupyter/template/before.sh.erb:

(
umask 077
cat > "${CONFIG_FILE}" << EOL
c.NotebookApp.ip = '*'
c.NotebookApp.port = ${port}
c.NotebookApp.port_retries = 0
c.NotebookApp.password = u'sha1:${SALT}:${PASSWORD_SHA1}'
c.NotebookApp.base_url = '/node/${host}/${port}/'
c.NotebookApp.open_browser = False
c.NotebookApp.allow_origin = '*'
c.NotebookApp.notebook_dir = '${HOME}'
c.NotebookApp.disable_check_xsrf = True
EOL
)

On my OnDemand host, in the _error_ssl.log I keep seeing:

[Sat Mar 27 20:02:25.411031 2021] [proxy:error] [pid 2093] (111)Connection refused: AH00957: WS: attempt to connect to 10.123.0.3:80 (*) failed
[Sat Mar 27 20:02:25.411108 2021] [proxy_wstunnel:error] [pid 2093] [client 10.21.74.14:4078] AH02452: failed to make connection to backend: carina-10-123-0-3

10.123.0.3 is the compute node where the python notebook is running, and I verified port 80 is open- my OnDemand host can do a GET with curl or something like that, so nothing is blocking port 80.