OOD 1.5.5: Cluster config: batch_connect: vnc: environment settings issue

rengland · October 16, 2019, 9:38pm

Having successfully deployed an OOD 1.5.5 instance in a sandbox on a VM, I am now trying to do the same in our HPC environment. I’m having an issue with the Interactive Desktop, specifically: the environment variable settings I have defined in my cluster configuration in /etc/ood/config/clusters.d are somehow not making their way into the slurm_script that runs the desktop on the compute node, and as a result, the desktop session attempts to find websockify at the default location (/opt/websockify/run), where it is not found, so though I can see all the Mate desktop processes running, in the absence of a socket, I am unable to connect.

Here is the cluster configuration in my HPC environment:

v2:
  metadata:
    title: "HPC"
  login:
    host: "login.hpc.myschool.edu"
  job:
    adapter: "slurm"
    cluster: "slurm_cluster"
    bin: "/cm/shared/apps/slurm/current/bin"
    conf: "/cm/shared/apps/slurm/var/etc/slurm.conf"
  batch_connnect:
    basic:
      script_wrapper: |
        module purge
        source /etc/environment
        %s
    vnc:
      script_wrapper: |
        module purge
        source /etc/environment
        module load python
        export PATH="/opt/TurboVNC/bin:${PATH}"
        export WEBSOCKIFY_CMD="/usr/bin/websockify"
        %s

Besides the hostname, the only difference in our (working) sandbox is that python (v3) is installed directly from an RPM rather than being loaded as a module. When we start a desktop in the sandbox environment, the job_script_content.sh produced in the user’s output log starts with the lines from the cluster config:

module purge
source /etc/environment
export PATH="/opt/TurboVNC/bin:$PATH"
export WEBSOCKIFY_CMD="/usr/bin/websockify"

In the job_script_content.sh on our new HPC instance, we’re missing those lines altogether, and around 16 lines into the output.log, when it attempts to start websockify, it logs this:

Script starting...
Starting websocket server...
/cm/local/apps/slurm/var/spool/job7940461/slurm_script: line 143: /opt/websockify/run: No such file or directory

In both environments, websockify is located at /usr/bin/websockify, not /opt/websockify/run.

Can anyone tell me why this is happening and what I need to do to fix it?

Thank you,

Richard

jeff.ohrstrom · October 17, 2019, 2:51pm

Hi, thanks for the all the details in your question! This works in your sandbox but not in your HPC (production?) environment, so that’s a clue. As a quick spot check be sure they’re the same versions. 1.5.5 isn’t that old, but it’s not that new either. (1.6.20 is latest just FYI).

You’re yml looks good. I copied it and read it and it loaded correctly.

But my guess is, if the libraries could have read the yml correctly they would written the job shell correctly (as they do in your lab sandbox). The library doesn’t interpret it in any way, so it’s actual content is only meaningful as it relates to yml parsing.

Can you do a quick diff on your cluster configs between the two environments? Sometimes it may be something as simple as a yml indentation issue. We use single lines like below. Maybe you can try that as a test instead of the |?

Obviously there’ll be some quoting issues, but my guess is it’s something simple like a yml parsing issue. The library can’t read it and just discards that portion.

  batch_connect:
      basic:
        script_wrapper: "module restore\n%s"
      vnc:
        script_wrapper: "module restore\nmodule load ondemand-vnc\n%s"

Also, it looks like you can set the websockify command directly with this parameter. Though, this will only get you half of the way there. You still won’t get the other bits out of the script_wrapper which isn’t happening for you now.

      vnc:
        # ...
        websockify_cmd: "/usr/bin/websockify"

efranz · October 17, 2019, 7:26pm

@rengland You have a mispelling in the YAML file. The pasted contents use the word batch_connnect not batch_connect. We have an issue open to add validation of the cluster config YAML structure so we can provide helpful error messages. We should prioritize this work.

rengland · October 18, 2019, 1:50pm

Thank you. I did catch this error yesterday morning after a good night’s rest. Though the syntax fix alone wasn’t sufficient to get the configuration working, I have a feeling it wouldn’t be working without it.

rengland · October 18, 2019, 2:06pm

Bringing the single lines into one with the \n separators, along with fixing the “batch_connnect” misspelling, seems to have worked – the environment settings are now being logged, and websockify is now running on the client.

Something strange happened, though, after I made the change: I started seeing Slurm connectivity errors, and my batch jobs weren’t being submitted. I had seen such errors before, so I had a general idea where to look for the problem. Our HPC (production) environment has but one large cluster, so we have never named it in our Slurm configuration, rather we just use the default “slurm_cluster” name. OnDemand doesn’t seem to deal with this very well, at least not consistently. I first tried removing the “cluster: slurm_cluster” definitions from both configuration files (in cluster.d and in apps/bc_desktop). After that, I was able to submit Slurm jobs (thus I was able to start the desktop processes on my client), but I received errors when I attempted to show job status within OOD. To fix that, I had to put the “cluster: slurm_cluster” back in the configuration in apps/bc_desktop. Both features now appear to do what they are intended, showing job status and also starting Mate desktop processes and websockify on the client.

Unfortunately, I still can’t connect to my desktop.

Here is the error I see in the output.log:

(mate-panel:13891): GLib-GObject-CRITICAL **: 08:23:22.632: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
mate-session[13834]: CRITICAL: gsm_systemd_set_session_idle: assertion 'session_path != NULL' failed

Any ideas as to how to get past this?

Thanks so much for all your help. Among open source user communities I’ve encountered, this is one of the best I’ve seen so far with regard to responsiveness. Politeness, too. Your assistance is greatly appreciated.

jeff.ohrstrom · October 18, 2019, 2:52pm

Thanks! That does sound strange.

I Imagine it was just the batch_connect misspelling that was the issue. If you want to move back to | I think that’d be fine, indeed it’s a lot more readable that way. That was just a suggestion to try. The misspelling was for sure the only problem.

To the error described - do you use CAS for authentication by any chance? A google search of that error actually showed this thread that has that exact error in it (in the post above the one linked). Looks like they needed to add CASScope to their apache configs.

rengland · October 18, 2019, 2:58pm

No, not CAS - we’re using Shibboleth with DUO.

I just discovered that if I edit the URL in my browser to reflect the FQDN of my client instead of the short name, noVNC is able to connect. That’s a little odd, too, since DNS is configured such that using either will resolve the same IP address, but I feel like I’m on the right track.

jeff.ohrstrom · October 18, 2019, 3:04pm

Yea I’m not sure why, but VNC needs the FQDN. There’s also this topic where this person had to change the hostname.

rengland · October 18, 2019, 3:08pm

It’s working now. As mentioned in your repost, I just had to change the host_regex back to the default ‘[^/]+’. I’d had it set that way before, but in my effort to get it working, I’d experimented with a few different settings and apparently never changed that one back.

Again, thanks much to both of you for your help. You guys are the best!

efranz · October 18, 2019, 5:02pm

Glad it is working! I recommend trying to determine a host regex that captures the hosts you allow without allowing just every host. The idea of the regex is to limit requests from authenticated users through the “dumb proxy”, which just uses the host and the port embedded in the URL from the user to determine which backend server to proxy to i.e. /rnode/HOST/PORT. If changing the host_regex back to the default fixed the problem, that means of coures that the host_regex was too restrictive, but there may be a less restrictive one that is still preferable to any host at all. See the tip, warning, and danger boxes in this section of the documentation: https://osc.github.io/ood-documentation/master/app-development/interactive/setup/enable-reverse-proxy.html#steps-to-enable-in-apache

Topic		Replies	Views
Interactive desktop with OOD not running on cluster Get Help	12	609	September 2, 2023
Unable to get interactive desktops running Get Help	45	1023	April 6, 2024
SLURM Interactive Desktop job not launching desktop, output.log is empty Get Help ondemand2 , question	5	321	December 25, 2022
OOD launching desktop on head node, not compute node Get Help	5	172	October 24, 2023
Simple initial configure questions Get Help question	2	859	January 9, 2023

OOD 1.5.5: Cluster config: batch_connect: vnc: environment settings issue

Related Topics