We have recently added some new nodes to our cluster and can’t figure out why noVNC won’t connect to them. The SSH connection works fine and TurboVNC IS running on the node and the connection is setup and running when the user launches an interactive session.
We’re seeing the exact same output in the user’s log files on the working nodes as we are the broken nodes, with the exception of this on the broken nodes:
ERROR: NVIDIA driver is not loaded
ERROR: Unable to load info from any available system
We believe this is a red herring - only indicating that there are NVIDIA drivers installed but not being loaded, because we aren’t running OpenGL on these nodes.
We’ve changed the hostnames on these new nodes but the regex changes we made to the ood_portal.yml file seem to work fine as the FQDN is setup correctly in the connection URL and the SSH connection works when an interactive job is launched.
There are no errors in the logs on the OOD server itself, nor anything on the nodes. The vnc.log shows the same exact info on the working nodes and the broken nodes, right up to the connection part. It simply stops at this line on the broken nodes:
29/01/2020 09:17:00 VNC extension running!
I can’t say the new nodes are EXACTLY the same as the old nodes as we have started a new installation and configuration setup. However, we’ve reviewed everything we have in place for OOD on the old nodes and can’t find anything different on the new nodes - websockify and TurboVNC are setup exactly the same. So we’re trying to figure out what the VNC connection is doing at this point that it’s failing. Are there any temp files it might be creating? Maybe we have some restrictions on writing files on these new installations that we missed. Are there other places where OOD logs that we can check? We’ve looked at /var/log/* and on the OOD server and in the user’s OnDemand directory, as well as the node’s log files.
Thanks for any pointers you can provide!