Jupyter failing to start

We have a instance of Open OnDemand 1.80.20. I have been working to get Jupyter running but have encountered errors that I cannot resolve. It seems that there are exceptions being thrown in the jupyter-notebook script but the job continues to attempt to run but fails with Permission error on /run/user/user_id followed by a timeout opening the port (see log and config below). I haven’t seen anything in the discussions that covers this series of errors.

form.yml

---

# **MUST** set cluster id here that matches cluster configuration file located
# under /etc/ood/config/clusters.d/*.yml

cluster: "summit"

# Define attribute values that aren't meant to be modified by the user within
# the Dashboard form
attributes:
  # Set the corresponding modules that need to be loaded for Jupyter to run
  #
  # @note It is called within the batch job as `module load <modules>` if
  #   defined
  # @example Do not load any modules
  #     modules: ""
  # @example Using default python module
  #     modules: "python"
  # @example Using specific python module
  #     modules: "python/3.5"
  # @example Using combination of modules
  #     modules: "python/3.5 cuda/8.0.44"
  modules: "python/3.5.1"

  # Any extra command line arguments to feed to the `jupyter notebook ...`
  # command that launches the Jupyter notebook within the batch job
  extra_jupyter_args: ""

# All of the attributes that make up the Dashboard form (in respective order),
# and made available to the submit configuration file and the template ERB
# files
#
# @note You typically do not need to modify this unless you want to add a new
#   configurable value
# @note If an attribute listed below is hard-coded above in the `attributes`
#   option, then it will not appear in the form page that the user sees in the
#   Dashboard

form:
  - modules
  - extra_jupyter_args
  - bc_account
  - bc_queue
  - bc_num_hours
  - bc_num_slots
  - bc_email_on_started

output.log

Script starting...
Waiting for Jupyter Notebook server to open port 15141...
TIMING - Starting wait at: Thu Jun 10 11:19:53 MDT 2021
TIMING - Starting main script at: Thu Jun 10 11:19:53 MDT 2021

Currently Loaded Modules:
  1) python/3.5.1

TIMING - Starting jupyter at: Thu Jun 10 11:19:54 MDT 2021
+ jupyter notebook --config=/home/jasw8470/ondemand/data/sys/dashboard/batch_connect/dev/my_jupyter_app/output/565fc17f-bc4f-4ddb-811a-9cd90d1262ea/config.py
[W 11:20:04.603 NotebookApp] Config option `disable_check_xsrf` not recognized by `NotebookApp`.
[W 11:20:04.608 NotebookApp] Config option `disable_check_xsrf` not recognized by `NotebookApp`.
Traceback (most recent call last):
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/traitlets/traitlets.py", line 501, in get
    value = obj._trait_values[self.name]
KeyError: 'runtime_dir'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/curc/sw/python/3.5.1/bin/jupyter-notebook", line 11, in <module>
    sys.exit(main())
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/jupyter_core/application.py", line 267, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/traitlets/config/application.py", line 595, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-7>", line 2, in initialize
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/traitlets/config/application.py", line 74, in catch_config_error
    return method(app, *args, **kwargs)
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/notebook/notebookapp.py", line 1069, in initialize
    self.init_configurables()
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/notebook/notebookapp.py", line 842, in init_configurables
    connection_dir=self.runtime_dir,
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/traitlets/traitlets.py", line 529, in __get__
    return self.get(obj, cls)
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/traitlets/traitlets.py", line 508, in get
    value = self._validate(obj, dynamic_default())
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/jupyter_core/application.py", line 99, in _runtime_dir_default
    ensure_dir_exists(rd, mode=0o700)
  File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/ipython_genutils/path.py", line 167, in ensure_dir_exists
    os.makedirs(path, mode=mode)
  File "/curc/sw/python/3.5.1/lib/python3.5/os.py", line 231, in makedirs
    makedirs(head, mode, exist_ok)
  File "/curc/sw/python/3.5.1/lib/python3.5/os.py", line 241, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/run/user/751315'
Timed out waiting for Jupyter Notebook server to open port 15141!
TIMING - Wait ended at: Thu Jun 10 11:21:45 MDT 2021
Cleaning up...

There was a thread on this

https://discourse.openondemand.org/t/ood-1-5-5-interactive-desktop-showing-black-screen-no-run-user-uid-directory/568/4

back in the ood 1.5 days. I don’t remember if an OS update fixed this, or an OOD update fixed it, but I haven’t run afoul of it with our jupyter notebook setup (CentOS 7.9, OOD 1.8, SLURM schedular).

Cheers,

Ric

OOPS! In /etc/profile.d/zz_last.sh on each compute nodes is the following 4 lines:

magic to deal with unsetting XDG_RUNTIME_DIR; It generally doesn’t exist on

compute nodes and if XDF_RUNTIME_DIR doesn’t exist, mate desktop makes

a mess of itself.

unset XDG_RUNTIME_DIR

So that was apparently our solution to the error you’re getting or some variant of it.

Ric

Sorry, I got side-tracted by a file-editor bug in 2.0. I’ve seen this before and quickly checked, but don’t recall the error. It’s something to do with a config defaulting to the XDG_RUNTIME_DIR that is the /run/user diretory.

Check this file at this line for _runtime_dir_default. A spot check of our app doesn’t look like we set the environment variable for it, but I’m not sure without looking into it further.

File "/curc/sw/python/3.5.1/lib/python3.5/site-packages/jupyter_core/application.py", line 99, in _runtime_dir_default

So I added:

export XDG_RUNTIME_DIR="/tmp/${UID}"

to template/script.sh.erb and it now works.

This is on RHEL8.3; ood 1.8; and slurm.