LSF job termination on VNC logout

Hi All,
Why is the corresponding LSF job not getting terminated when logging out from an interactive desktop session? Logging out stops X and VNC server on the worker node, but won’t stop the corresponding LSF job. The job would still show up in OOD, but if one tries to reconnect to that same session it fails.
See screenshots below:
1


3

It works as expected with Slurm, but not with LSF.
Does anyone have an idea?
Thanks and best regards

It’s unclear to me why that would be the case in LSF. Can you tell me what processes’ (if any) are still running on the compute node when you logout? I mean processes attributed yourself, from that job. And maybe a pstree from the top process (one with the lowest pid).

It seems to me LSF thinks that something is still running. Why it thinks that I suppose is what we’d have to determine. I’m not familiar with how LSF works if there’s a daemon on the compute node or what. Mabye a lsof 2>/dev/null | grep $USER | grep deleted (list of all deleted files that still have an inode) will show something? That’s kinda guessing in the dark though, I’d hope the process bit would show us more.

This could be due to LSF tracking a VNC process that has not ended. You can issue the following command:

bjobs -l jobid

You should find a list of PPID’s and PID’s. On the system in question, look for the pids and ppid’s using the ‘ps -ef | grep pid’ command.

With VNC, you have to be careful as it likes to daemonize processes. Which LSF will continue to track. The way you avoid this is by preventing VNC from damonizing the commands by enabling:

LSB_RESOURCE_ENFORCE=“cpu gpu memory”
LSF_PROCESS_TRACKING=Y
LSF_LINUX_CGROUP_ACCT=Y

In your lsf.conf, then restart everything. However, after that VNC may not be happy.

If on the other hand either ‘bjobs -l’ shows no pids or ppids, or those pids and ppids don’t exist, you may have stumbled on an LSF bug. In that case, restart the sbatch on the host, then open a ticket with support or apply the latest LSF service pack.

Larry

Hi,

What happens is that the Desktop app starts 2 websockify processes

> sashka   235378      1  0 16:05 ?        00:00:00 /usr/bin/python /bin/websockify -D 10592 localhost:5903
> sashka   235776 235378  0 16:05 ?        00:00:00 /usr/bin/python /bin/websockify -D 10592 localhost:5903

one of which would remain running after logout, which prevents the parent LSF job from exiting

> sashka   235378      1  0 16:05 ?        00:00:00 /usr/bin/python /bin/websockify -D 10592 localhost:5903

Any ideas as to why that happens?

Thanks,
–Alex

I was just wondering if there’s any updated on this.
We moved to OOD 1.8 but we still have this issue with LSF.
VNC is properly killed but websockify keeps running and the job doesn’t get killed.
After starting the bc_desktop app:

367818 115279 /lsf/10.1/linux3.10-glibc2.17-x86_64/etc/res -d /lsf/conf -m rkanc004is02 /home/maffiaa/.lsbatch/1607523224.412572
115279 115368 /bin/sh /home/maffiaa/.lsbatch/1607523224.412572
115368 115372 /bin/bash /home/maffiaa/.lsbatch/1607523224.412572.shell
     1 115404 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: rkanc003is01:1 (maffiaa) -auth /home/maffiaa/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/maffi
aa/.vnc/x509_cert.pem -x509key /home/maffiaa/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0
115372 115430 bash /home/maffiaa/ec-hub/data/sys/dashboard/batch_connect/dev/bc_desktop/output/03b5de19-a899-4540-98cb-3fa582bf9328/script.sh
115372 115446 /bin/bash /home/maffiaa/.lsbatch/1607523224.412572.shell
115446 115448 /bin/bash /home/maffiaa/.lsbatch/1607523224.412572.shell
115448 115449 tail -f --pid=115430 vnc.log
     1 115450 /usr/bin/python /bin/websockify -D 23292 localhost:5901
     1 115458 dbus-launch --autolaunch cf648096f92e4bd689dca505bfde2ea6 --binary-syntax --close-stderr
     1 115459 /usr/bin/dbus-daemon --fork --print-pid 6 --print-address 8 --session
     1 115461 /usr/lib64/xfce4/xfconf/xfconfd
115430 115467 xfce4-session
     1 115470 /bin/dbus-launch --sh-syntax --exit-with-session xfce4-session
     1 115471 /usr/bin/dbus-daemon --fork --print-pid 6 --print-address 8 --session
     1 115476 /usr/lib64/xfce4/xfconf/xfconfd
115467 115478 xfwm4 --display :1.0 --sm-client-id 211e4a471-7c5f-48d7-87d4-6d25a172c7eb
115467 115480 xfce4-panel --display :1.0 --sm-client-id 24ed787fb-e753-4e0f-93d7-d0b725e01cca
     1 115481 xfsettingsd --display :1.0 --sm-client-id 25e79dc77-b78a-480c-a879-8d02501b59dc
115467 115485 xfdesktop --display :1.0 --sm-client-id 2a23afc15-7a4f-4cbb-ba9b-bb3160c5c004
     1 115492 /usr/libexec/gvfsd
115467 115497 abrt-applet
     1 115500 /usr/libexec/gvfsd-fuse /home/maffiaa/.gvfs -f -o big_writes
115467 115502 nm-applet
     1 115512 /usr/libexec/imsettings-daemon
115467 115520 /usr/bin/python /usr/share/system-config-printer/applet.py
115480 115530 /usr/lib64/xfce4/panel/wrapper-1.0 /usr/lib64/xfce4/panel/plugins/libsystray.so 6 14680094 systray Notification Area Area where notification icons appear
115480 115533 /usr/lib64/xfce4/panel/wrapper-1.0 /usr/lib64/xfce4/panel/plugins/libactions.so 2 14680095 actions Action Buttons Log out, lock or other system actions
     1 115536 /usr/libexec/gvfs-udisks2-volume-monitor
     1 115553 /usr/libexec/gvfs-mtp-volume-monitor
     1 115585 /usr/libexec/gvfs-gphoto2-volume-monitor
     1 115605 /usr/libexec/gvfs-afc-volume-monitor
     1 115658 /usr/libexec/at-spi-bus-launcher
115658 115675 /usr/bin/dbus-daemon --config-file=/usr/share/defaults/at-spi2/accessibility.conf --nofork --print-address 3
     1 115679 /usr/libexec/at-spi2-registryd --use-gnome-session
115492 115682 /usr/libexec/gvfsd-trash --spawner :1.11 /org/gtk/gvfs/exec_spaw/0
     1 115711 /usr/libexec/gvfsd-metadata
115512 115800 /usr/bin/ibus-daemon -r --xim
     1 115802 /usr/libexec/dconf-service
115800 115808 /usr/libexec/ibus-dconf
115800 115809 /usr/libexec/ibus-ui-gtk3
     1 115811 /usr/libexec/ibus-x11 --kill-daemon
     1 115815 /usr/libexec/ibus-portal
115800 115827 /usr/libexec/ibus-engine-simple

After pressing on “launch desktop” we have the exact same processes plus:
115450 119722 /usr/bin/python /bin/websockify -D 23292 localhost:5901
so that’s the second websockify process started

After logout:
1 115450 /usr/bin/python /bin/websockify -D 23292 localhost:5901

So the first websockify process is still alive after the logout.
After killing the job the process gets killed as well.

I have also added a “set -x” to the script to get the commands that get’s executed and the “clean_up” function for vnc template:

clean_up () {
  echo "Cleaning up..."
  [[ -e "/home/maffiaa/ec-hub/data/sys/dashboard/batch_connect/dev/bc_desktop/output/6948fa9c-ee05-47c5-b2ed-bb05aebbb2f8/clean.sh" ]] && source "/home/maffiaa/ec-hub/data/sys/dashboard/batch_connect/dev/bc_desktop/output/6948fa9c-ee05-47c5-b2ed-bb05aebbb2f8/clean.sh"

  vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || vncserver -kill "$1)}'
  [[ -n ${display} ]] && vncserver -kill :${display}

  [[ ${SCRIPT_PID} ]] && pkill -P ${SCRIPT_PID} || :
  pkill -P $$
  exit ${1:-0}
}

generate this output:

+ clean_up
+ echo 'Cleaning up...'
Cleaning up...
+ [[ -e /home/maffiaa/ec-hub/data/sys/dashboard/batch_connect/dev/bc_desktop/output/03b5de19-a899-4540-98cb-3fa582bf9328/clean.sh ]]
+ source /home/maffiaa/ec-hub/data/sys/dashboard/batch_connect/dev/bc_desktop/output/03b5de19-a899-4540-98cb-3fa582bf9328/clean.sh
+ vncserver -list
+ awk '/^:/{system("kill -0 "$2" 2>/dev/null || vncserver -kill "$1)}'
+ [[ -n 1 ]]
+ vncserver -kill :1
Killing Xvnc process ID 115404
Gdk-Message: 14:56:51.046: nm-applet: Fatal IO error 11 (Resource temporarily unavailable) on X server :1.0.

+ [[ -n 115430 ]]
+ pkill -P 115430
+ :
+ pkill -P 115372
+ exit 0

So here you can also see the pid of the processes that get killed in the cleaning phase:

 1 115404 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: rkanc003is01:1 (maffiaa) -auth /home/maffiaa/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/maffi
aa/.vnc/x509_cert.pem -x509key /home/maffiaa/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0
115372 115430 bash /home/maffiaa/ec-hub/data/sys/dashboard/batch_connect/dev/bc_desktop/output/03b5de19-a899-4540-98cb-3fa582bf9328/script.sh
115368 115372 /bin/bash /home/maffiaa/.lsbatch/1607523224.412572.shell
  1. TurboVNC has “1” as parent but get killed by the clean directly (pid 115404)
  2. script.sh get killed (pid 115430)
  3. script.sh’s parent, that is LSF shell, gets killed (pid 115372)

second websockify (pid 119722) gets cleaned but I have no idea when.
First websockify process:
1 115450 /usr/bin/python /bin/websockify -D 23292 localhost:5901
with “1” as parent stays until the job is killed.

I hope that’s enough info to understand if the issue can be solved just adding LSF config:

LSB_RESOURCE_ENFORCE=“cpu gpu memory”
LSF_PROCESS_TRACKING=Y
LSF_LINUX_CGROUP_ACCT=Y

Or we need to do anything else

Hi. Sorry for the delay. I was actually able to replicate on our Torque/Moab cluster. Tough the process with parent id 1 ( 115450 from your example above) was reaped before I could message my admin to look into it. I’d say it lived for maybe 2 minutes after the job completed. Are you saying that the process continues to live forever?

Yes, so the full list of my process after I logout from the desktop is:

root       6912  89217 sshd: maffiaa [priv]
maffiaa   89217  89232 sshd: maffiaa@pts/0
maffiaa   89232  89233 -bash
maffiaa       1 115450 /usr/bin/python /bin/websockify -D 23292 localhost:5901
maffiaa   89233 120431 ps -ax -o user,ppid,pid,command
maffiaa   89233 120432 grep --color=auto maffiaa

So, to recap, after logout from desktop app:

  1. The only process “alive” is the websockify with parent pid 1
  2. The process stays there until the desktop timeout is reached (the one defined when starting the app) and gets cleaned when the job is killed
  3. The bjobs command returns the job still running

This seems related to the answer from adamsla since it’s like LSF doesn’t kill websockify process when killing the “shell” process that run it.
This problem may have been happening for the TurboVNC command as well (but that one is killed by the “vncserver -kill” command.
Actually, all the other processes that are probably started by the “xfce-session” command have parent pid 1 as well but they get cleaned. Maybe, since they are started within the “script.sh”, they get cleaned when you kill it. But that’s just a guess.
Do you think this can be an LSF behaviour only?

So I did something like this (still to clean) in my clean.sh:

#!/usr/bin/env bash
echo "Running cleaning..."
WEBSK_PID=`ps -u $USER -o ppid,pid,command | grep "D ${websocket} localhost:${port}" | awk '$1=="1" {print $2}'`
kill ${WEBSK_PID}
echo "Exit cleaning"

So I get the ${USER}'s processes and grep the websockify command with the same env variables used to start the websockify (to be sure not killing other processes running on the same host even from the same user).
I store the pid of the process with “1” as parent pid and run a “kill” command against it.
This seems to work. Do you think that’s “good enough”? Is there a better solution?
I tried to have the same code in the “submit.yml” at the end of the “script_wrapper” but the “clean_up” function run an “exit” at the end so nothing after that function gets executed (so nothing after ‘%s’ in the script_wrapper). I also check how to override the “vnc_clean” or even the “clean_script” but I’m not sure how to do it and if that’s preferred.
Any idea?

Just one more question about the websockify commands.
It seems the “Desktop app” starts the “main” websockify command and every time you press on “launch Desktop” command you start a new websockify process that have the main websockify as parent.
All those browser sessions are in sync (if I do anything in one tab it just show in the others). Probably that’s the wanted behaviour.
The question is about “exit”: from the Desktop it is possible to do logout/restart/shutdown but all those actions seems to be calling the “clean_up”, is it that expected or there’s anything like being able to logout and login later again?
It seems just closing the “browser tab” kills the “second” websockify process and it is possible to launch it again. If that’s the way user interact with the Desktop app, is this documented anywhere (I may have missed it)?
Thanks again for your help

Good Afternoon, I noticed a bunch of processes on our nodes that fit a similar issue.
No jobs running, however there are a bunch of “python -m websockify -D 44864 localhost:5901” running.
Was a fix ever implemented to clean this up?

Thanks,

@charles8ronson
In our case our submit.yml looks like:

batch_connect:
  template: vnc
...
  vnc_clean: |
    WEBSK_PID=`ps -u $USER -o ppid,pid,command | grep "D ${websocket} localhost:${port}" | awk '$1=="1" {print $2}'`
    echo "Clean websockify processes"
    [[ ${WEBSK_PID} ]] && kill ${WEBSK_PID} || echo "No Websockify to Kill"
    vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || vncserver -kill "$1)}'
    echo "Clean geoclue"
    GEOCLUE_PID=`ps -u $USER -o ppid,pid,command | grep "/usr/libexec/geoclue-2.0/demos/agent" | awk '$1=="1" {print $2}'`
    [[ ${GEOCLUE_PID} ]] && kill ${GEOCLUE_PID} || echo "No Geoclue to Kill"

The geoclue process clean may not be needed in your case