Problems with mate desktop launch on compute nodes

The OOD server proxies to the compute node, the hostname the scheduler’s given in the job information. You can tell by the URL where it’s trying to proxy to, host and port are given in the path query parameter when you try to connect.

so the flow is:
client --> OOD --> computenode:<websockify port>

But the issue seems to be that OOD is trying to connect to the websockify server on the compute node that never started.

Interesting. I see the following when I launch an interactive job in the which tells me it is running the websockify:
host: pplhpc1gn002.cm.cluster
port: 5901
password: Ox1Vhtsw
display: 1
websocket: 50325
spassword: NiCT7R66

Looks like it listening: netstat -tupln |grep LISTEN
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 60978/Xvnc
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 127.0.0.1:47569 0.0.0.0:* LISTEN 446833/mpirun
tcp 0 0 0.0.0.0:50325 0.0.0.0:* LISTEN 61024/python
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 8637/sshd

Just to test something, what is the port range for websockify that is used by default. 50000 -?

You can configure min and max, but it seems to default to anything greater than 2000 and < 65k.

OK so, webocket is booting now? OK cool. So maybe it is a networking thing? In the example you’ve given the OOD webserver needs to have connectivity to the given compute node (dns resolvable hostname - pplhpc1gn002.cm.cluster) on port 50325.

Hmm okay, so I think my problem is the interface the OOD webserver is running on.
Currently I have an external and internal interfaces which cannot communicate between each other as see below. I am going to try changing up the web interface of OOD to be proxied from Apache as such. Can you confirm the following?:
[Current Setup]
OOD Web [ 172.31.192.X ]
OOD internal [ 172.16.3.X]
Clients -> OOD web --X–> internal compute

[Proposed Setup]
OOD Web [ 172.16.3.X ] (Proxied by Apache so clients can reach from 172.31.192.X)
Clients -> OOD web (via 172.31.192.X) -> proxied to 172.16.3.X -> internal compute: websockify port

Thoughts?

Maybe? Our apache config binds to all interfaces (*). The question is how to force the outbound socket to open from the internal interface (which is what your proposed solution is)?

Maybe an iptable rule(s) would would for you? Like the external ip:443 routes to internal ip:443 that may force outbound connections to use the internal interface. Or route the other way - force all outbound tcp connections to use the internal interface.

@tdockendorf do you have any suggestions? We need to proxy to an internal network from an external interface.

Could use static routes so if your internal network is 172.16.3.0/24 with gateway of 172.16.3.1 and internal=eth1 and external=eth0 then do something like this in /etc/sysconfig/network-scripts/route-eth1: 172.16.3.0/24 via 172.16.3.1. If your internal interface has the same subnet as the hosts you’re trying to access then it should route just fine without a static route, which should be visible when doing route -n command to see the current routing. I think RHEL systems will setup default gateway route if you specify a GATEWAY in the ifcfg file in /etc/sysconfig/network-scripts.

If that doesn’t work then you may have to setup complicated iptables rules to route outbound 443 to internal interface. I think that would require a PREROUTING rule but I haven’t done those in years as I’ve moved away from doing multi-homed systems in favor of switch routing.

Circling back to this, we updated our IPs so now we are running on the external network. However I still can’t launch mate interactive sessions. Everything looks like it is now and I can connect over xrdp but looks like there might be an issue with mate. Ref:
Connect url:
https://pplhpc1ln1.childrens.sea.kids/pun/sys/dashboard/noVNC-1.1.0/vnc.html?utf8=✓&autoconnect=true&path=rnode%2Fpplhpc1gn002%2F54773%2Fwebsockify&resize=remote&password=b3uKKt8H&compressionsetting=6&qualitysetting=2&commit=Launch+mate+desktop

Error message in /var/log/messages on that node:
Jun 12 10:36:26 pplhpc1gn002 org.gtk.vfs.Daemon: A connection to the bus can’t be made

Jun 12 10:36:26 pplhpc1gn002 org.gtk.vfs.Daemon: A connection to the bus can’t be made

Jun 12 10:36:38 pplhpc1gn002 org.a11y.Bus: Activating service name=‘org.a11y.atspi.Registry’

Jun 12 10:36:38 pplhpc1gn002 org.a11y.Bus: Successfully activated service ‘org.a11y.atspi.Registry’

Jun 12 10:36:38 pplhpc1gn002 org.a11y.atspi.Registry: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/spice-vdagent.desktop: Key file does not start with a group

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f320 finalized while still in-construction

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: could not read /home/bpette/.config/autostart/spice-vdagent.desktop

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/pulseaudio.desktop: Key file does not start with a group

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f250 finalized while still in-construction

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: could not read /home/bpette/.config/autostart/pulseaudio.desktop

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f4c0 finalized while still in-construction

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: could not read /home/bpette/.config/autostart/gnome-keyring-gpg.desktop

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/xfce4-power-manager.desktop: Key file does not start with a group

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f0b0 finalized while still in-construction

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: could not read /home/bpette/.config/autostart/xfce4-power-manager.desktop

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/rhsm-icon.desktop: Key file does not start with a group

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f0b0 finalized while still in-construction

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Jun 12 10:36:38 pplhpc1gn002 mate-session[91682]: WARNING: could not read /home/bpette/.config/autostart/rhsm-icon.desktop

Jun 12 10:36:38 pplhpc1gn002 org.gtk.vfs.AfcVolumeMonitor: Volume monitor alive

Jun 12 10:36:39 pplhpc1gn002 dbus[17991]: [system] Activating service name=‘org.mate.SettingsDaemon.DateTimeMechanism’ (using servicehelper)

Jun 12 10:36:39 pplhpc1gn002 dbus[17991]: [system] Successfully activated service ‘org.mate.SettingsDaemon.DateTimeMechanism’

Those are from /var/log/messages, what’s the output of the job? (It’s in a deep path like this ~/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/vdi-owens/output/729943bf-98b9-459c-8910-fbcb8f58179d/output.log) .

If you’re worried about connectivity between the two I’d test that. You can try a telnet from the OOD webserver host to the compute host. You can find the destination host:port combinations in connection.yml in the same directory the output log is in above.

Hmm, I am not seeing an output.log in the directory.
I do connect just fine from web host on the port combos.

Thanks,

Is the directory empty? That output.log file is what we’ve told the scheduler to use as the output file. What scheduler do you use? In say SLURM we’re setting the -o option to point to it. Or is that why they’re in /var/log/messages. Either way, you shouldn’t have to look at system logfiles to debug, the scheduler should be outputting the files to that directory.

No the directory has stuff in it, just no output.log. When it was failing because it couldn’t find the TurboVNC binary it would give an output, but when it works I am not seeing the output.log.

Is there any reason why when i put the FQDN in the oodportal.yml that it wouldn’t show that in the URL?
Worth noting
Prod=1.7
Dev=1.6.22
ref:
Prod=/rnode/hpc1gn002/38805/websockify

Dev=
/rnode/hpc1cn001.interna.cluster/35810

After some digging I found the log file, not sure why it isn’t being written back but here is what it looks like:
Setting VNC password…

Starting VNC server…

Desktop ‘TurboVNC: pplhpc1gn002:1 (bpette)’ started on display pplhpc1gn002:1

Log file is vnc.log

Successfully started VNC server on pplhpc1gn002:5901…

Script starting…

Starting websocket server…

WebSocket server settings:

  • Listen on :38805

  • No SSL/TLS support (no cert file)

  • Backgrounding (daemon)

Scanning VNC log file for user authentications…

Generating connection YAML file…

cmdTrace.c(713):ERROR:104: ‘restore’ is an unrecognized subcommand

cmdModule.c(411):ERROR:104: ‘restore’ is an unrecognized subcommand

Launching desktop ‘mate’…

cat: /etc/xdg/autostart/gnome-keyring-gpg.desktop: No such file or directory

cat: /etc/xdg/autostart/pulseaudio.desktop: No such file or directory

cat: /etc/xdg/autostart/rhsm-icon.desktop: No such file or directory

cat: /etc/xdg/autostart/spice-vdagent.desktop: No such file or directory

cat: /etc/xdg/autostart/xfce4-power-manager.desktop: No such file or directory

generating cookie with syscall

generating cookie with syscall

generating cookie with syscall

generating cookie with syscall

mate-session[140587]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/spice-vdagent.desktop: Key file does not start with a group

mate-session[140587]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f320 finalized while still in-construction

mate-session[140587]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

mate-session[140587]: WARNING: could not read /home/bpette/.config/autostart/spice-vdagent.desktop

mate-session[140587]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/pulseaudio.desktop: Key file does not start with a group

mate-session[140587]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f250 finalized while still in-construction

mate-session[140587]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

mate-session[140587]: WARNING: could not read /home/bpette/.config/autostart/pulseaudio.desktop

mate-session[140587]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group

mate-session[140587]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f4c0 finalized while still in-construction

mate-session[140587]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

mate-session[140587]: WARNING: could not read /home/bpette/.config/autostart/gnome-keyring-gpg.desktop

mate-session[140587]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/xfce4-power-manager.desktop: Key file does not start with a group

mate-session[140587]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f0b0 finalized while still in-construction

mate-session[140587]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

mate-session[140587]: WARNING: could not read /home/bpette/.config/autostart/xfce4-power-manager.desktop

mate-session[140587]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/rhsm-icon.desktop: Key file does not start with a group

mate-session[140587]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f0b0 finalized while still in-construction

mate-session[140587]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

mate-session[140587]: WARNING: could not read /home/bpette/.config/autostart/rhsm-icon.desktop

SELinux Troubleshooter: Applet requires SELinux be enabled to run.

(nm-applet:140703): nm-applet-WARNING **: 16:36:09.055: NetworkManager is not running

/usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version(‘Notify’, ‘0.7’) before import to ensure that the right version gets loaded.

from gi.repository import Notify

system-config-printer-applet: failed to start NewPrinterNotification service

system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection “:1.6082” is not allowed to own the service “com.redhat.PrinterDriversInstaller” due to security policies in the configuration file

Initializing caja-image-converter extension

Initializing caja-open-terminal extension

*** ERROR ***

TI:16:36:09 TH:0x6cda60 FI:gpm-manager.c FN:gpm_manager_systemd_inhibit,1784

OK! We can’t see the stack there, but I’m guessing it’s the same as this topic. The mate-power-manager package is bad news. It just doesn’t work for multi user systems. (It won’t let users see the power buttons on that machine). We don’t have it on our systems, and indeed when I build a MATE singularity image for desktops I don’t include it. So I would get rid of it. Again, users shouldn’t need to see that panel anyhow - it’s the button to stop/restart, so it’s useless to them anyhow.

The very last line of your output is exactly the same as the other topic. For whatever reason the stderr of your job isn’t combined with the stdout. What kind of scheduler do you have? I want to be sure there’s not something amiss on our side.

We use PBSPro 13.X atm. I will give removing the mate-power-manager a try and see if that resolves the issue. Also, i found the “output.log” gets populated after the job finishes only when it fails…

After some fiddling with the system pathing I am now getting the correct output.log after each job. Please find below:

Setting VNC password…
Starting VNC server…

Desktop ‘TurboVNC: pplhpc1ood01:1 (bpette)’ started on display pplhpc1ood01:1

Log file is vnc.log
Successfully started VNC server on pplhpc1ood01:5901…
Script starting…
Starting websocket server…
WebSocket server settings:

  • Listen on :61830
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    Scanning VNC log file for user authentications…
    Generating connection YAML file…
    cmdTrace.c(713):ERROR:104: ‘restore’ is an unrecognized subcommand
    cmdModule.c(411):ERROR:104: ‘restore’ is an unrecognized subcommand
    Launching desktop ‘mate’…
    cat: /etc/xdg/autostart/gnome-keyring-gpg.desktop: No such file or directory
    cat: /etc/xdg/autostart/rhsm-icon.desktop: No such file or directory
    generating cookie with syscall
    generating cookie with syscall
    generating cookie with syscall
    generating cookie with syscall
    mate-session[119775]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group
    mate-session[119775]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x767410 finalized while still in-construction
    mate-session[119775]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[119775]: WARNING: could not read /home/bpette/.config/autostart/gnome-keyring-gpg.desktop
    mate-session[119775]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/rhsm-icon.desktop: Key file does not start with a group
    mate-session[119775]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x767410 finalized while still in-construction
    mate-session[119775]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[119775]: WARNING: could not read /home/bpette/.config/autostart/rhsm-icon.desktop
    vmware-user: could not open /proc/fs/vmblock/dev
    /usr/bin/vmtoolsd: symbol lookup error: /lib64/libvmtools.so.0: undefined symbol: intf_close
    SELinux Troubleshooter: Applet requires SELinux be enabled to run.
    /usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version(‘Notify’, ‘0.7’) before import to ensure that the right version gets loaded.
    from gi.repository import Notify
    system-config-printer-applet: failed to start NewPrinterNotification service
    system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection “:1.2140” is not allowed to own the service “com.redhat.PrinterDriversInstaller” due to security policies in the configuration file

(nm-applet:119889): nm-applet-WARNING **: 13:59:27.630: NetworkManager is not running
Initializing caja-image-converter extension
Initializing caja-open-terminal extension

(caja:119852): GLib-CRITICAL **: 13:59:28.127: g_hash_table_foreach: assertion ‘version == hash_table->version’ failed
mate-session[119775]: WARNING: Detected that screensaver has left the bus
Window manager warning: Fatal IO error 11 (Resource temporarily unavailable) on display ‘:1’.
mate-settings-daemon: Fatal IO error 11 (Resource temporarily unavailable) on X server :1.
[1592254767,000,xklavier.c:xkl_engine_start_listen/] The backend does not require manual layout management - but it is provided by the application
Gdk-Message: 13:59:45.746: abrt: Fatal IO error 11 (Resource temporarily unavailable) on X server :1.

Gdk-Message: 13:59:45.748: mate-session: Fatal IO error 104 (Connection reset by peer) on X server :1.

Okay, so I am running out of things to test here, but here goes:
I recently did a fresh install of a cluster node with mate desktop to try and determine where this was having an issue. No matter how I configured the node I am still not able to launch VNC sessions to mate (or any other desktop) for the interactive desktop. After some deep diving into the web page I am seeing:

WebSocket connection to ‘wss://pdlhpc1ln1.childrens.sea.kids/rnode/pdlhpc1cn001/13431/websockify’ failed: Error during WebSocket handshake: Unexpected response code: 404

Does this help point to a problem? Or am I still not able to pinpoint the reason why I can’t launch a remote desktop?

Alright, figured out the issue. the ‘rnode/$hostname’ is not properly resolving to ‘rnode/$hostname.example.com’ if I manually put it in there it works. I have it set up for the “host_regex” to use the FQDN but it isn’t working.

Awesome! Sorry for the delay.

I see earlier we set the set_host parameter. It must be choosing the short name instead of the long name? I would fix the FQDN issue there because that’s how that URL is being populated to begin with.

The regex is there so folks limit what folks can proxy to, I don’t believe it can change the url, only block it.

Though the nodes are set to use the FQDN apparently it wants the base hostname to be that as well. I ran a 'hostnamectl set-hostname $hostname.fqdn.com" and that fixed it… I spent WAAAAAAAYYYY to long troubleshooting something so simple.

Note: This has to be run on each compute node.