Problems with mate desktop launch on compute nodes

Apologies if this was solved somewhere else, but I can’t seem to find anything related to this issue specifically. I recently updated the turbovnc to 2.2 in order to resolve the vnc issue and that is gone. Everything looks like it launches correctly but when I try to connect to the node I get a noVNC error that says “Failed to connect to server.” I have confirmed that I can connect to the server and to the port but on the node I see the following errors when this launches. Are these unrelated or do I have a config issue?

Apr 29 07:35:15 pplhpc1gn001 org.a11y.Bus: Activating service name=‘org.a11y.atspi.Registry’

Apr 29 07:35:15 pplhpc1gn001 org.a11y.Bus: Successfully activated service ‘org.a11y.atspi.Registry’

Apr 29 07:35:15 pplhpc1gn001 org.a11y.atspi.Registry: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: Could not parse desktop file /home/$user/.config/autostart/spice-vdagent.desktop: Key file does not start with a group

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f320 finalized while still in-construction

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: could not read /home/$user/.config/autostart/spice-vdagent.desktop

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: Could not parse desktop file /home/bpette/.config/autostart/pulseaudio.desktop: Key file does not start with a group

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f250 finalized while still in-construction

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: could not read /home/$user/.config/autostart/pulseaudio.desktop

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: Could not parse desktop file /home/$user/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f4c0 finalized while still in-construction

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: could not read /home/$user/.config/autostart/gnome-keyring-gpg.desktop

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: Could not parse desktop file /home/$user/.config/autostart/xfce4-power-manager.desktop: Key file does not start with a group

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f0b0 finalized while still in-construction

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: could not read /home/$user/.config/autostart/xfce4-power-manager.desktop

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: Could not parse desktop file /home/$user/.config/autostart/rhsm-icon.desktop: Key file does not start with a group

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x72f0b0 finalized while still in-construction

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.

Apr 29 07:35:15 pplhpc1gn001 mate-session[32727]: WARNING: could not read /home/$user/.config/autostart/rhsm-icon.desktop

Apr 29 07:35:15 pplhpc1gn001 org.gtk.vfs.AfcVolumeMonitor: Volume monitor alive

Apr 29 07:35:16 pplhpc1gn001 dbus[16696]: [system] Activating service name=‘org.mate.SettingsDaemon.DateTimeMechanism’ (using servicehelper)

Apr 29 07:35:16 pplhpc1gn001 dbus[16696]: [system] Successfully activated service ‘org.mate.SettingsDaemon.DateTimeMechanism’

Doesn’t look like anything to me.

Though, I’ve ran into this issue that’s easy to miss: websockify doesn’t exist or can’t be found and there’s only 1 line about it. So I’d say verify that because what you’ve listed here I don’t think are errors (and besides you’ve said you can connect to the vnc server, so all seems well on that front).

Okay, let me take a look, i have websockify installed there but I could see it not being found.

When i do the following it returns:
which websockify
/usr/bin/websockify

rpm check:
rpm -qa |grep python-websockify
python-websockify-0.8.0-1.el7.noarch

My cluster config looks like:
batch_connect:
basic:
script_wrapper: |
module purge
%s
set_host: “host=$(hostname -A | awk ‘{print $1}’)”
vnc:
script_wrapper: |
module purge
export PATH="/opt/TurboVNC/bin:PATH" export WEBSOCKIFY_CMD="/usr/bin/websockify" %s set_host: "host=(hostname -A | awk ‘{print $1}’)"

So not sure what else to check here.

Thanks,

Yep! It defaults to /opt/websockify/run. You can change it in your cluster config below.

      batch_connect:
        vnc:
          websockify_cmd: '/usr/bin/websockify'

Apologies, is that not what I have there> or does it need to be lower case?

Thanks,

My Apologies! I didn’t notice that. Maybe you need newlines?

script_wrapper: |
   module purge
   export PATH="/opt/TurboVNC/bin:PATH" 
   export WEBSOCKIFY_CMD="/usr/bin/websockify" 
   %s      
set_host: "host=(hostname -A | awk ‘{print $1}’)"

That is how it looks in my script. I think it just lost its formatting when pasted. I will try "dos2unix"ing it and see if that helps.

Is the error message /opt/websockify/run: No such file or directory or /usr/bin/websockify? That’ll tell us whether the environment variable is working (which it should be if you say that’s the formatting).

This is the file:command it’s trying to invoke, so WEBSOCKIFY_CMD is the right environment variable.

job_script_content.sh:${WEBSOCKIFY_CMD:-/opt/websockify/run} -D ${websocket} localhost:${port}

So either the variable isn’t working like we think or it really can’t find it on the compute node.

Here is a thought. How does OpenOndemand handle connecting to the nodes via the scheduler? Is it the OpenOndemand web interface that is trying to connect on the port running on the client?

Basically, I have the web interface running on a 172.31.X network.
And the client running on the 172.16.3.X network.

The web interface has a 172.16.3.X address as well, but could it be that it is not using this interface appropriately?

Thanks,

The OOD server proxies to the compute node, the hostname the scheduler’s given in the job information. You can tell by the URL where it’s trying to proxy to, host and port are given in the path query parameter when you try to connect.

so the flow is:
client --> OOD --> computenode:<websockify port>

But the issue seems to be that OOD is trying to connect to the websockify server on the compute node that never started.

Interesting. I see the following when I launch an interactive job in the which tells me it is running the websockify:
host: pplhpc1gn002.cm.cluster
port: 5901
password: Ox1Vhtsw
display: 1
websocket: 50325
spassword: NiCT7R66

Looks like it listening: netstat -tupln |grep LISTEN
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 60978/Xvnc
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 127.0.0.1:47569 0.0.0.0:* LISTEN 446833/mpirun
tcp 0 0 0.0.0.0:50325 0.0.0.0:* LISTEN 61024/python
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 8637/sshd

Just to test something, what is the port range for websockify that is used by default. 50000 -?

You can configure min and max, but it seems to default to anything greater than 2000 and < 65k.

OK so, webocket is booting now? OK cool. So maybe it is a networking thing? In the example you’ve given the OOD webserver needs to have connectivity to the given compute node (dns resolvable hostname - pplhpc1gn002.cm.cluster) on port 50325.

Hmm okay, so I think my problem is the interface the OOD webserver is running on.
Currently I have an external and internal interfaces which cannot communicate between each other as see below. I am going to try changing up the web interface of OOD to be proxied from Apache as such. Can you confirm the following?:
[Current Setup]
OOD Web [ 172.31.192.X ]
OOD internal [ 172.16.3.X]
Clients -> OOD web --X–> internal compute

[Proposed Setup]
OOD Web [ 172.16.3.X ] (Proxied by Apache so clients can reach from 172.31.192.X)
Clients -> OOD web (via 172.31.192.X) -> proxied to 172.16.3.X -> internal compute: websockify port

Thoughts?

Maybe? Our apache config binds to all interfaces (*). The question is how to force the outbound socket to open from the internal interface (which is what your proposed solution is)?

Maybe an iptable rule(s) would would for you? Like the external ip:443 routes to internal ip:443 that may force outbound connections to use the internal interface. Or route the other way - force all outbound tcp connections to use the internal interface.

@tdockendorf do you have any suggestions? We need to proxy to an internal network from an external interface.

Could use static routes so if your internal network is 172.16.3.0/24 with gateway of 172.16.3.1 and internal=eth1 and external=eth0 then do something like this in /etc/sysconfig/network-scripts/route-eth1: 172.16.3.0/24 via 172.16.3.1. If your internal interface has the same subnet as the hosts you’re trying to access then it should route just fine without a static route, which should be visible when doing route -n command to see the current routing. I think RHEL systems will setup default gateway route if you specify a GATEWAY in the ifcfg file in /etc/sysconfig/network-scripts.

If that doesn’t work then you may have to setup complicated iptables rules to route outbound 443 to internal interface. I think that would require a PREROUTING rule but I haven’t done those in years as I’ve moved away from doing multi-homed systems in favor of switch routing.