Stability Issues using OnDemand

Hello, we have experienced some stability issues with Open OnDemand with a varying number of simultaneous users (as low as 10 to as many as 80). We use this cluster to run training sessions and bootcamps, and OOD is a key part of our workflow.

We deploy OOD using NVIDIA DeepOps, which uses the OSC OOD Ansible role under the hood. We don’t expose the cluster login node to the Internet directly, but instead have our users SSH tunnel to connect, so that they connect via the tunnel to http://localhost:9090 .

The two main issues we’re experiencing are:

  1. A subset of users who successfully logged in to the cluster were not able to open the ondemand page (http://localhost:9090/) and launch the labs. (I tested this personally using their credentials)
    The errors reported include
    Internal Server Error
    502 bad gateway error on the page
    failed to map user
    403 forbidden error on the page

  2. A subset of users who successfully logged in to the cluster and launched the labs received errors as below:
    Service Unavailable
    The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.Apache/2.4.29 (Ubuntu) Server at localhost Port 9090

Our workaround for both of these is to
sudo killall nginx
But doing this in the middle of a bootcamp kills all the processes and effects all the users, so we are hoping to find the root of the problem so that we don’t have to resort to these measures.

We are requesting help with this issue. What next steps would you recommend for gathering information or addressing this issue?

The failed to map user 403 forbidden error on the page could be some sssd caching issue. You say that the user has to ssh into the box first, then connect to OOD?

I’d checkout out /var/log/apache2 for the 502 GW error and /var/log/messages our journalctl for some of the 503s.

I’d wonder if some of the 2nd error is load on apache or ulimits. I know apache2 starts with some silly defaults (like MaxKeepAliveRequests 100), maybe tweak these to increase capacity? Also look into the apache event mpm and increasing it’s settings (or the worker mpm if you want simpler config settings).

Also you can have users bookmark localhost:9000/nginx/stop?redir=/pun/sys/dashboard/ so they can restart their own PUNs when required.

Thank you for the recommendations! :slight_smile:

We have actually already updated MaxKeepAliveRequests to 1000 in the past, so I think we are all good there.

We went with your tips to dig into various /var/log/ directories and we found a few errors such as:

From /var/log/ondemand-nginx

[ W 2020-05-21 16:25:55.0089 48255/T3 age/Cor/App/Poo/AnalyticsCollection.cpp:61 ]: ERROR: Cannot fork() a new process: Resource temporarily unavailable (errno=11)

[ C 2020-05-20 04:56:51.5854 37347/T1 age/Cor/CoreMain.cpp:1342 ]: ERROR: boost::thread_resource_error: Resource temporarily unavailable

From /var/log/apache2

[ (111)Connection refused: AH02454: HTTP: attempt to connect to Unix domain socket /var/run/ondemand-nginx/u00u5sy0nohcJdb8W9357/passenger.sock (*) failed

We think these errors are suggesting that we might be hitting some ulimit resource limits. Any suggestions on how to increase these limits specifically for the OOD processes would be greatly appreciated!

I found that we set this file so root has unlimited processes, so that maybe it because we initialize things as root and then fork into the user.

[~()] 🐯  cat /etc/security/limits.d/20-nproc.conf 
# This file is being maintained by Puppet.
# DO NOT EDIT
*    soft nproc 4096
root soft nproc unlimited

We run RHEL 7.9 and here are my ulimits as regular user. I’m having trouble sudoing into root to see it’s ulimits, but from searching /etc/security/ that’s the only override I came up with, to nproc.

[~()] 🐼  ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256899
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited