Websocket issue with interactive desktop

Hello,

I am seeing a strange error with the VNC interactive desktop application. The same job, submitted to two different nodes, both running the same image, succeeds on one and fails on the other. For the successful one, the VNC desktop tab opens properly and nothing is printed to the dev tools console in Chrome. For the failed one, I get a “Failed to connect to server” error and the following printed to the dev tools console:

websock.js:185 WebSocket connection to 'wss://ondemand.ccast.ndsu.edu/rnode/node0015.thunder.ccast/18314/websockify' failed: Error during WebSocket handshake: Unexpected response code: 503
open @ websock.js:185
rfb.js:255 WebSocket on-error event
(anonymous) @ rfb.js:255
rfb.js:718 Failed when connecting: Connection closed (code: 1006)
_fail @ rfb.js:718

As I mentioned, both of these nodes are running the same image, and are even the same hardware build. I did not find any hints in the job root directories as to what the problem might be. When we first started working with VNC through OOD, we had an issue related to CAS, but that has been resolved.

Any suggestions on how to troubleshoot this?

You’re looking at client side errors, I’d look at the backend, in the job’s logs of the error-ed job. If I had to guess I’d guess that they’re not the same image or they’re slightly off.

What I mean is, you’re looking at the affect on the client side. The cause is, something on the backed didn’t boot up properly (websockify or the vncserver booted then died, or never booted at all).

Yes, you’re right. I should have been looking at the logs in the job directory. The issue seems to be resolved now, but here were the contents of the output.log file in the job directory (some paths and hostnames redacted):

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: node0015:1 (user)' started on display node0015:1

Log file is vnc.log
Successfully started VNC server on node0015:5901...
Script starting...
Starting websocket server...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
WebSocket server settings:
  - Listen on :18314
  - Flash security policy server
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall

(xfce4-session:17235): xfce4-session-WARNING **: 14:50:01.476: xfsm_manager_load_session: Something wrong with /home/user/.cache/sessions/xfce4-session-node0015:1, Does it exist? Permissions issue?

(xfwm4:17252): xfwm4-WARNING **: 14:50:01.549: Error opening /dev/dri/card0: Permission denied
xfwm4: Fatal IO error 4 (Interrupted system call) on X server :1.0.

(xfsettingsd:17261): libxfce4ui-WARNING **: 14:58:35.680: ICE I/O Error

(xfsettingsd:17261): libxfce4ui-WARNING **: 14:58:35.680: Disconnected from session manager.
xfce4-panel: Fatal IO error 4 (Interrupted system call) on X server :1.0.
xfsettingsd: Fatal IO error 11 (Resource temporarily unavailable) on X server :1.0.
xfdesktop: Fatal IO error 11 (Resource temporarily unavailable) on X server :1.0.

I looked for the presence of the file mentioned – .cache/sessions/xfce4-session-node0015:1 – but did not find one.

Based on your experience, do you know why this sort of error occurs? Or have you even seen this one before? My guess would be something like a stale session not being correctly cleaned up, but I really have no idea.

Hopefully this information is a little more actionable.

Thanks,
Nick

Unfortunately XFCE or X11 messages aren’t super actionable (not just yours provided, but in general); they’re just like that, that it failed (and indeed, you can disregard some like those WARNINGS).

What we’ve noticed is, if there’s a different dbus-launch binary in your $PATH, that can cause errors. Typically we see this from a users’ python conda environments, so that could be a place to start looking (or a system’s conda environment if your looking at different images).

Other things: Try a yum history on your instances to see if they differ. See the env if they differ. See if a module list differs. I think you can see the pattern here: see what’s different from one instance to the other.