Hello, we have experienced some stability issues with Open OnDemand with a varying number of simultaneous users (as low as 10 to as many as 80). We use this cluster to run training sessions and bootcamps, and OOD is a key part of our workflow.
We deploy OOD using NVIDIA DeepOps, which uses the OSC OOD Ansible role under the hood. We don’t expose the cluster login node to the Internet directly, but instead have our users SSH tunnel to connect, so that they connect via the tunnel to http://localhost:9090 .
The two main issues we’re experiencing are:
A subset of users who successfully logged in to the cluster were not able to open the ondemand page (http://localhost:9090/) and launch the labs. (I tested this personally using their credentials)
The errors reported include
Internal Server Error
502 bad gateway error on the page
failed to map user
403 forbidden error on the page
A subset of users who successfully logged in to the cluster and launched the labs received errors as below:
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.Apache/2.4.29 (Ubuntu) Server at localhost Port 9090
Our workaround for both of these is to
sudo killall nginx
But doing this in the middle of a bootcamp kills all the processes and effects all the users, so we are hoping to find the root of the problem so that we don’t have to resort to these measures.
We are requesting help with this issue. What next steps would you recommend for gathering information or addressing this issue?