Interactive-desktop immediately completed

Hello.
I’m trying to run an xfce interactive desktop on the same node hosting the OOD portal.
The job is correctly submitted, the VNC server is started, I can manually connect to it using the password of the connection.yml file, but as soon as I submit the job I receive the message successful submitted and state Completed. Is a certificate strictly required?
Following an excerpt of the output.log.
Any suggestion?

Thanks,
Luigi


Log file is vnc.log
Successfully started VNC server on crescof-nvi1.frascati.enea.it:5907…
Script starting…
Starting websocket server…

Launching desktop ‘xfce’…
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall

(xfce4-session:16868): xfce4-session-WARNING **: 09:35:39.876: xfsm_manager_load_session: Something wrong with /afs/enea.it/fra/user/lbuccif/.cache/sessions/xfce4-session-crescof-nvi1.frascati.enea.it:7, Does it exist? Permissions issue?
env: xfwm4: No such file or directory
env: Thunar: No such file or directory
env: xfdesktop: No such file or directory
SELinux Troubleshooter: Applet requires SELinux be enabled to run.
vmware-user: could not open /proc/fs/vmblock/dev
WebSocket server settings:

  • Listen on :56154
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    Scanning VNC log file for user authentications…
    Generating connection YAML file…
    /usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version(‘Notify’, ‘0.7’) before import to ensure that the right version gets loaded.
    from gi.repository import Notify
    system-config-printer-applet: failed to start NewPrinterNotification service
    system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection “:1.1660” is not allowed to own the service “com.redhat.PrinterDriversInstaller” due to security policies in the configuration file
    xfce4-panel: No window manager registered on screen 0. To start the panel without this check, run with --disable-wm-check.
    xfsettingsd: No window manager registered on screen 0.

(xfsettingsd:16911): xfsettingsd-WARNING **: 09:35:45.046: Failed to get the _NET_NUMBER_OF_DESKTOPS property.
User defined signal 2
User defined signal 2

It would seem that you’re missing a lot of XFCE4 binaries.

Hi Jeff,
I installed the missing binaries and the above errors are resolved. However, I don’t understand why when I start a job immediately appears the status Completed and I don’t see the start button, the compression sliders, the image quality, the labels number of nodes, cores… only in one case did it happen but NoVNC gave me a connection error

Thanks a lot,
Luigi

OK, what are the error logs now?

It’s because the job ends before you can connect to it. These are shell scripts after all. At some point, it’ll block because you booted the desktop. But if the desktop binary immediately fails (i.e., it doesn’t block), the shell script is done executing and the job is complete.

The situation is as follows: the job is sent and the desktop remains active for the entire duration (1 hour, after this time the output log is produced). On the OnDemand portal it immediately assumes the Completed status, but is running on the compute node (if I connect with vncviewer and the parameters of the Connection.yml file it is active and responds). Similar behavior for gnome and kde, with a different output.
Note: the OOD portal host is the same of the compute node, I don’t Know if this could be a problem.

The last lines of the output.log for xfce:

WebSocket server settings:

  • Listen on :36285
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    Scanning VNC log file for user authentications…
    Generating connection YAML file…
    generating cookie with syscall
    generating cookie with syscall
    generating cookie with syscall
    generating cookie with syscall

(xfce4-session:12513): xfce4-session-WARNING **: 17:44:19.598: xfsm_manager_load_session: Something wrong with /afs/enea.it/por/user/lbuccip/.cache/sessions/xfce4-session-crescof-nvi1.frascati.enea.it:6, Does it exist? Permissions issue?
SELinux Troubleshooter: Applet requires SELinux be enabled to run.
vmware-user: could not open /proc/fs/vmblock/dev
/usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version(‘Notify’, ‘0.7’) before import to ensure that the right version gets loaded.
from gi.repository import Notify
system-config-printer-applet: failed to start NewPrinterNotification service
system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection “:1.10860” is not allowed to own the service “com.redhat.PrinterDriversInstaller” due to security policies in the configuration file
User defined signal 2

OK then I would check your /var/log/ondemand-nginx/$USER/error.log to see if there’s any information here.

What’s happening on the OOD side is you POST the request to submit the job. We do that. Then you hit the index page of all the interactive sessions and we query for that job id (squeue for Slurm, not sure what your scheduler is).

At this point if squeue or similar fails OOD will say the job is completed because squeue or similar won’t return anything for completed jobs, so we assume the job is complete if you can’t actually query for the job id.

I’d wonder what squeue <jobid> returns on that machine for the job when it’s in fact running.

Hi Jeff,
I submit on LSF, AFS file system with Kerberos authentication, following some status information:

/var/log/ondemand-nginx/lbuccip/error.log

App 24493 output: [2024-03-06 10:17:52 +0100 ] INFO “execve = ["git", "describe", "–always", "–tags"]”
App 24493 output: [2024-03-06 10:17:52 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/form/session_contexts/new format=html controller=BatchConnect::SessionContextsController action=new status=200 duration=93.03 view=67.89”
App 24493 output: [2024-03-06 10:18:02 +0100 ] INFO “execve = [{"LSF_BINDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin", "LSF_LIBDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib", "LSF_ENVDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/conf", "LSF_SERVERDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc"}, "ssh", "-p", "22", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=no", "crescof-nvi1.frascati.enea.it", "export LSF_BINDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin;", "export LSF_LIBDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib;", "export LSF_ENVDIR=/afs/enea.it/software/lsf/10.1/frascati/conf;", "export LSF_SERVERDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc;", "/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bsub", "-P", "lbuccip", "-cwd", "/afs/enea.it/por/user/lbuccip/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/form/output/49f5fca0-b9c7-457f-9b69-0716f046597b", "-J", "sys/dashboard/sys/bc_desktop/form", "-q", "interactive", "-W", "60", "-L", "/bin/bash", "-o", "/afs/enea.it/por/user/lbuccip/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/form/output/49f5fca0-b9c7-457f-9b69-0716f046597b/output.log", "-n", "4"]”
App 24493 output: [2024-03-06 10:18:02 +0100 ] INFO “method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/form/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=955.54 view=0.00 location=https://crescof-in-gui.frascati.enea.it:453/pun/sys/dashboard/batch_connect/sessions
App 24493 output: [2024-03-06 10:18:02 +0100 ] INFO “execve = [{"LSF_BINDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin", "LSF_LIBDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib", "LSF_ENVDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/conf", "LSF_SERVERDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc"}, "ssh", "-p", "22", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=no", "crescof-nvi1.frascati.enea.it", "export LSF_BINDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin;", "export LSF_LIBDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib;", "export LSF_ENVDIR=/afs/enea.it/software/lsf/10.1/frascati/conf;", "export LSF_SERVERDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc;", "/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs", "-a", "-w", "-W", "71021"]”
App 24493 output: [2024-03-06 10:18:03 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 duration=477.52 view=41.30”
App 24493 output: [2024-03-06 10:18:13 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=23.77 view=14.14”
App 24493 output: [2024-03-06 10:18:23 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=19.75 view=10.52”
App 24493 output: [2024-03-06 10:18:34 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=20.05 view=10.70”
App 24493 output: [2024-03-06 10:18:34 +0100 ] INFO “method=GET path=/pun/sys/dashboard/apps/show/activejobs format=html controller=AppsController action=show status=302 duration=9.32 view=0.00 location=https://crescof-in-gui.frascati.enea.it:453/pun/sys/dashboard/activejobs
App 24493 output: [2024-03-06 10:18:34 +0100 ] INFO “method=GET path=/pun/sys/dashboard/activejobs format=html controller=ActiveJobsController action=index status=200 duration=22.47 view=15.07”
App 24493 output: [2024-03-06 10:18:34 +0100 ] INFO “execve = [{"LSF_BINDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin", "LSF_LIBDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib", "LSF_ENVDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/conf", "LSF_SERVERDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc"}, "ssh", "-p", "22", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=no", "crescof-nvi1.frascati.enea.it", "export LSF_BINDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin;", "export LSF_LIBDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib;", "export LSF_ENVDIR=/afs/enea.it/software/lsf/10.1/frascati/conf;", "export LSF_SERVERDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc;", "/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs", "-u", "lbuccip", "-a", "-w", "-W"]”
App 24493 output: [2024-03-06 10:18:35 +0100 ] DEBUG “wrote jobs to stream: 3”
App 24493 output: [2024-03-06 10:18:35 +0100 ] INFO “method=GET path=/pun/sys/dashboard/activejobs.json format=json controller=ActiveJobsController action=index status=200 duration=502.87 view=0.00”
App 24493 output: [2024-03-06 10:18:38 +0100 ] INFO “execve = [{"LSF_BINDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin", "LSF_LIBDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib", "LSF_ENVDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/conf", "LSF_SERVERDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc"}, "ssh", "-p", "22", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=no", "crescof-nvi1.frascati.enea.it", "export LSF_BINDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin;", "export LSF_LIBDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib;", "export LSF_ENVDIR=/afs/enea.it/software/lsf/10.1/frascati/conf;", "export LSF_SERVERDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc;", "/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs", "-a", "-w", "-W", "71021"]”
App 24493 output: [2024-03-06 10:18:38 +0100 ] INFO “method=GET path=/pun/sys/dashboard/activejobs/json format=json controller=ActiveJobsController action=json status=200 duration=427.63 view=3.09”
App 24493 output: [2024-03-06 10:23:07 +0100 ] INFO “method=GET path=/pun/sys/dashboard/logout format=html controller=DashboardController action=logout status=200 duration=23.11 view=15.13”
[ N 2024-03-06 10:28:07.6877 24412/T3 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 24493, application /var/www/ood/apps/sys/dashboard (production)
App 8139 output: [2024-03-06 11:23:34 +0100 ] WARN “Error opening MOTD at \nException: bad URI(is not URI?): nil”
App 8139 output: [2024-03-06 11:23:34 +0100 ] INFO “method=GET path=/pun/sys/dashboard/ format=html controller=DashboardController action=index status=200 duration=200.38 view=147.87”
App 8139 output: [2024-03-06 11:23:37 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 duration=44.78 view=34.92”
App 8139 output: [2024-03-06 11:23:40 +0100 ] INFO “method=GET path=/pun/sys/dashboard/files/fs/afs/enea.it/por/user/lbuccip/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/form/output/49f5fca0-b9c7-457f-9b69-0716f046597b format=html controller=FilesController action=fs status=200 duration=47.34 view=34.46”
App 8139 output: [2024-03-06 11:23:40 +0100 ] INFO “method=GET path=/pun/sys/dashboard/files/fs/afs/enea.it/por/user/lbuccip/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/form/output/49f5fca0-b9c7-457f-9b69-0716f046597b format=html controller=FilesController action=fs status=200 duration=49.09 view=23.87”
App 8139 output: [2024-03-06 11:23:47 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=22.30 view=13.38”
App 8139 output: [2024-03-06 11:23:52 +0100 ] INFO “method=GET path=/pun/sys/dashboard/files/fs/afs/enea.it/por/user/lbuccip/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/form/output/49f5fca0-b9c7-457f-9b69-0716f046597b/output.log format=html controller=FilesController action=fs status=200 duration=22.65 view=0.00”
App 8139 output: [2024-03-06 11:23:58 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=61.77 view=38.85”
App 8139 output: [2024-03-06 11:24:09 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=19.72 view=10.64”
App 8139 output: [2024-03-06 11:24:19 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=19.76 view=10.65”

command: bjobs 71021

<lbuccip@crescof-nvi1 ~> bjobs 71021
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
71021 lbuccip RUN interactiv crescof-in- 4*crescof-i *ktop/form Mar 6 10:18

myjobs panel:

Here’s the logline where it’s querying for that job. I wonder what the next line wrote jobs to stream: 3 means? STDOUT and STDERR are streams 1 and 2 respectively, so I’d ask why it’s writing to some other stream?

I don’t know why in this run the stream is set to 3 … generally It’s set to 1, but I don’t think that’s the reason.
A new run gives this output:


App 32232 output: [2024-03-06 15:59:25 +0100 ] INFO “execve = [{"LSF_BINDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin", "LSF_LIBDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib", "LSF_ENVDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/conf", "LSF_SERVERDIR"=>"/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc"}, "ssh", "-p", "22", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=no", "crescof-nvi1.frascati.enea.it", "export LSF_BINDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin;", "export LSF_LIBDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/lib;", "export LSF_ENVDIR=/afs/enea.it/software/lsf/10.1/frascati/conf;", "export LSF_SERVERDIR=/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/etc;", "/afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs", "-a", "-w", "-W", "71099"]”
App 32232 output: [2024-03-06 15:59:25 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 duration=457.21 view=25.24”
App 32232 output: [2024-03-06 15:59:35 +0100 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=18.75 view=10.13”

The job 71099 is still running but in My Interactive Session window It results to be Completed:

If you try to run that command directly I wonder what the output is. Seems we’re not parsing the output correctly.

I would try to replicate the same exact ssh command by hand (and issue it on the same machine) and see what it’s returning.

Maybe we are close to the cause of the problem.

The problem:

<lbucci@crescof-nvi1 ~> ssh -p 22 -o BatchMode=yes -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no crescof-nvi1.frascati.enea.it /afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -a -w -W 71115
LSF is down. Please wait …
LSF is down. Please wait …
LSF is down. Please wait …

but if we insert a sleep command to give LSF time to wake up we will have:

a possible solution

<lbucci@crescof-nvi1 ~> ssh -p 22 -o BatchMode=yes -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no crescof-nvi1.frascati.enea.it sleep 60 && /afs/enea.it/software/lsf/10.1/frascati/10.1/linux3.10-glibc2.17-x86_64/bin/bjobs -a -w -W 71115
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME PROJ_NAME CPU_USED MEM SWAP PIDS START_TIME FINISH_TIME SLOTS
71115 lbucci RUN interactive crescof-in-gui crescof-in-gui sys/dashboard/sys/bc_desktop/form 03/06-17:29:13 lbucci 000:00:15.00 134 0 30279,30526,30531,30800,30812,30862,30901,30902,30905,30911,30914,30915,30919,31195,31197,31205,31210,31211,31217,31228,31255,31277,31279,31443,31447,31448,31450,31452,31465,31469,31484,31485 03/06-17:29:15 - 1

And now: which is the best solution to correct this behavior?

You can use bin_overrides to supply a new command for bjobs (you may already be using it). You can use this to provide some sort of wrapper shell script to either sleep or do anything else you need to give LSF the time it needs.

Hi Jeff,
following your advice I can start xfce, gnome and kde on the same node.
Great wok!

Thank you very much and see you next time.
Luigi

1 Like