noVNC and shell session timeout after 1 minute

Hi,

I’m having an issue with sessions timing out quicker than expected. It seems to be an issue with websocket connections to the client and being unable to reconnect after about a minute of inactivity. I’ve noticed that there is a ticket for issue #142 that mentions this.

Shell sessions close with a ‘connection terminated message’ and trying to reconnect with noVNC returns an authentication error.

Any idea how I can fix this?

Thank you.

Using clusters->shell to access a login node, my jobs stick around idle until the 2 hour host side TMOUT kicks them. Your problem has the feel of something on the node not liking your process on that node, and killing it. Is there anything in your batch controller logs that provide clues?

Also, are there any host or network firewalls that could be in play here?

Cheers,

Ric

I have iptables set to accept all traffic from the OnDemand server, firewalld is disabled. OnDemand is running inside a container so that could be causing some unknown issues. Our testing VM doesn’t have this issue when connecting to the same client.

App 7078 output: Listening on 3000
App 7078 output: Connection established
App 7078 output: Opened terminal: 7121
App 7078 output: Closed terminal: 7121
[ N 2021-08-10 15:07:52.7983 422/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 6977, application /var/www/ood/apps/sys/dashboard (production)
[ N 2021-08-10 15:09:17.2787 422/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 7078, application /var/www/ood/apps/sys/shell (production)
10/08/2021 09:21:39 Got connection from client 127.0.0.1
10/08/2021 09:21:39 Using protocol version 3.8
10/08/2021 09:21:39 Enabling TightVNC protocol extensions
10/08/2021 09:21:39 Advertising Tight auth cap 'VENCRYPT'
10/08/2021 09:21:39 Advertising Tight auth cap 'VNCAUTH_'
10/08/2021 09:21:39 Advertising Tight auth cap 'ULGNAUTH'
10/08/2021 09:21:39 rfbVncAuthProcessResponse: authentication failed from 127.0.0.1
10/08/2021 09:21:39 Client 127.0.0.1 gone
10/08/2021 09:21:39 Statistics:
10/08/2021 09:21:39   framebuffer updates 0, rectangles 0, bytes 0

VNC passwords are one time use. So let’s say you start the session, you get password A, when you click through to noVNC you’ll use password A. Next time you try to connect, you’ll need password B which was generated and added to the ‘Connect to’ link. So say, refreshing the page won’t work, you need to click through the ‘Connect to’ button.

For shell, I can’t say for sure - I’d say let’s focus on noVNC first. Clearly you’re getting auth errors from the log. So maybe there’s a network storage sync problem with the container? Like the overlay (the filesystem in the container) has written the new password B but the underlay (the host file system, NFS or whatever storage you’re using) doesn’t see the new write?

I’m using autofs to mount user directories and store session data. It looks like a new password is generated in connection.yaml when I launch a new desktop window. But if I attempt to reconnect in noVNC, it gives an authentication error and no new password is generated.

What’s really strange though is why I can’t keep a session alive when away from keyboard for more than about 60 seconds. I think some sort of timeout is closing the websocket connection.

Okay thanks for the help! It took some digging but I’ve found a solution. If you add a heartbeat (ping/pong) function to the websocket server then the connection stays open.

For the shell application I added these lines of code to the index file at /var/www/ood/apps/sys/shell/app.js from the official javascript ws package here.

+function noop() {}

+function heartbeat() {
+  this.isAlive = true;
+}

wss.on('connection', function connection (ws, req) {
  var dir,
      term,
      args,
      host,
      cmd = process.env.OOD_SSH_WRAPPER || 'ssh';

+  ws.isAlive = true;
+  ws.on('pong', heartbeat);
   ...
});

+const interval = setInterval(function ping() {
+  wss.clients.forEach(function each(ws) {
+    if (ws.isAlive === false) return ws.terminate();
+
+    ws.isAlive = false;
+    ws.ping(noop);
+  });
+}, 30000);

Then for remote desktop, we added a --heartbeat=30 option to the websockify_cmd snippet in the vnc.rb file, line 135.

/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.13/gems/ood_core-0.17.2/lib/ood_core/batch_connect/templates/vnc.rb
# Launch websockify websocket server
echo "Starting websocket server..."
websocket=$(find_port)
#{websockify_cmd} -D ${websocket} --heartbeat=30 localhost:${port}

I didn’t find a fix for the missing password on reconnect like you mentioned, but noVNC should remain open until site_timeout, which is a major improvement.