Module not found launching Desktop

There relevant logs should be before these messages, specifically about vncerver starting (or not starting in your case). vncserver starts, it setsup the environment and we capture the $DISPLAY then run the sub shell (mate.sh in this case) setting DISPLAY in the command. That’s one of the main things that needs to happen in job_script_content.sh.

Here’s the relevant snippet from my logfile. vncserver started with DISPLAY :14.

# this is the very first line in output.log

Restoring modules from user's default, for system: "owens"
Setting VNC password...
Starting VNC server...
Killing Xvnc process ID 91037
Xvnc process ID 91037 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X12
Xvnc did not appear to shut down cleanly. Removing /tmp/.X12-lock

Warning: o0808.ten.osc.edu:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server o0808.ten.osc.edu:1

# ... removed for brevity 

Desktop 'TurboVNC: o0808.ten.osc.edu:14 (johrstrom)' started on display o0808.ten.osc.edu:14 and it was the first bit out output in the logfile.

Log file is vnc.log
Successfully started VNC server on o0808.ten.osc.edu:5914...
Script starting...

Our output.log file has the following.

Script starting…
Generating connection YAML file…
Launching desktop ‘mate’…
Desktop ‘mate’ ended…
Cleaning up…

Don’t see any reference to the DISPLAY variable or to the vncserver in the script.sh file.

#!/usr/bin/env bash

cd “{HOME}" module purge && module restore export SHELL="(getent passwd $USER | cut -d: -f7)”
echo “Launching desktop ‘mate’…”
source “/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/bc860540-4964-435f-afb0-603b29d8745b/desktops/mate.sh”
echo “Desktop ‘mate’ ended…”

Here is the form.yml file


attributes:
desktop: “mate”
bc_vnc_idle: 0
bc_vnc_resolution:
required: true
node_type: null

form:

  • bc_vnc_idle
  • desktop
  • bc_account
  • bc_num_hours
  • bc_num_slots
  • node_type
  • bc_queue
  • bc_vnc_resolution
  • bc_email_on_started

Here is the submit.yml.erb file.


batch_connect:
template: vnc

It looks like you’re not using a vnc template. You actually don’t need to specify the vnc template because we already do.

---
batch_connect:
  template: vnc

It looks like you have specified this, but maybe it’s getting tripped up on it being in that directory. We usually put it in a sub directory and specify it. This file specifically tells the scheduler what sort of resources to use, using script.native. Note that we don’t specify any batch_connect attributes.

Just FYI for debugging in the futture the job_script_content.sh is what get’s submitted, it’s the wrapper utility script to start the actual scripts script.sh. What's in the job_script_content.shfile (what utilities are added) are specified by thebatch_connect.template` attribute (vnc or basic).

The form.yml and submit.yml.erb file were the ones installed with the OOD software. Sorry I confused the post by including them. When it stops it displays “This app is missing information required to establish a connection. Please contact support if you see this message.” Here is the cluster configuration file.

v2:
metadata:
title: “Ivy Cluster”
login:
host: “ivy.hpcc.ttu.edu”
job:
adapter: “sge”
cluster: “ivy”
bin: “/export/uge/bin/lx-amd64”
conf: “/tmp/test_sge.conf”
sge_root: “/export/uge”
libdrmaa_path: “/export/uge/lib/lx-amd64/libdrmaa.so”
batch_connect:
basic:
script_wrapper: |
module purge
%s
vnc:
script_wrapper: |
module purge
export PATH="/opt/TurboVNC/bin:$PATH"
export WEBSOCKIFY_CMD="/opt/websockify/run"
%s

Start of the /var/log/ondemand-nginx/thomasbr/error.log when I start launcher. Seems like the qsub is correct.

App 9968 output: [2020-03-31 16:13:41 -0500 ] INFO “execve = [{}, “/export/uge/bin/lx-amd64/qsub”, “-wd”, “/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/943b157e-ba75-4163-b530-9435d0ec9855”, “-N”, “desktop_interactive”, “-o”, “/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/943b157e-ba75-4163-b530-9435d0ec9855/output.log”, “-q”, “ondemand”, “-l”, “h_rt=08:00:00”, “-P”, “communitycluster”, “-pe”, “sm”, “4”]”
App 9968 output: [2020-03-31 16:13:41 -0500 ] INFO “method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/ivy/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=331.08 view=0.00 location=https://ondemand.hpcc.ttu.edu/pun/sys/dashboard/batch_connect/sessions”

Later in the file we encounter an error.

App 9968 output: errno2str(18) = DRMAA_ERRNO_INVALID_JOB
App 9968 output:
App 9968 output: [2020-03-31 16:14:22 -0500 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=240.00 view=25.84”
App 9968 output: [2020-03-31 16:14:32 -0500 ] INFO “execve = [{}, “/export/uge/bin/lx-amd64/qstat”, “-r”, “-xml”, “-j”, “745437”]”

It seems your using a basic template (batch_connect.template: basic) so the wrapper script isn’t starting VNC. That’s the “missing information” it’s complaining about, the VNC connection information. I think your cluster config seems OK.

Where is this form.yml you’re using to submit this desktop? What’s the directory location? What other files are in that directory?

Thanks for the other logs, but I think your job logs are the most relevant. They indicate a couple things to me (a), you’re able to submit to the scheduler fine, so there’s no issues with that and (b) there’s nothing about vnc starting up and there is indeed issues there.

Did you check your job_content_script.sh? It should have vnc related steps like this:

# Start up vnc server (if at first you don't succeed, try, try again)
echo "Starting VNC server..."
for i in $(seq 1 10); do
  # Clean up any old VNC sessions that weren't cleaned before
  vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || vncserver -kill "$1)}'

  # Attempt to start VNC server
  VNC_OUT=$(vncserver -log "vnc.log" -rfbauth "vnc.passwd" -nohttpd -noxstartup -geometry 1536x780 -idletimeout 0  2>&1)
  VNC_PID=$(pgrep -s 0 Xvnc) # the script above will daemonize the Xvnc process
  echo "${VNC_OUT}"

  # Sometimes Xvnc hangs if it fails to find working disaply, we
  # should kill it and try again
  kill -0 ${VNC_PID} 2>/dev/null && [[ "${VNC_OUT}" =~ "Fatal server error" ]] && kill -TERM ${VNC_PID}

  # Check that Xvnc process is running, if not assume it died and
  # wait some random period of time before restarting
  kill -0 ${VNC_PID} 2>/dev/null || sleep 0.$(random_number 1 9)s

# ... and so on

Here is the /etc/ood/config/apps/bc_desktop/ivy.yml file


title: “Ivy Desktop”
cluster: “ivy”
submit: “submit/submit.yml.erb”

attributes:
bc_account: “communitycluster”
bc_queue: “ondemand”
bc_num_slots:
value: 1
bc_num_hours:
value: 8

The only thing in the directory is the submit directory and the ivy.yml file. In the /etc/ood/config/apps/bc_desktop/submit directory is the submit.yml.erb file. Do I need to change the “template: basic” to “template: vnc”?


batch_connect:
template: “basic”

script:
queue_name: “ondemand”
accounting_id: “communitycluster”
job_name: “desktop_interactive”
native:
- “-pe”
- “sm”
- “4”

Where should the job_content_script.sh be located. I can’t find it on the OOD node.

Yes, absolutely.This is the cause of your issue.

Looks like it is finding the VNC server but still is encountering error. Here is the output.log file

Setting VNC password…
Starting VNC server…

Desktop ‘TurboVNC: compute-19-10:1 (thomasbr)’ started on display compute-19-10:1

Log file is vnc.log
Successfully started VNC server on compute-19-10:5901…
Script starting…
Starting websocket server…
Launching desktop ‘mate’…
Scanning VNC log file for user authentications…

The vnc.log file

TurboVNC Server (Xvnc) 64-bit v2.2.4 (build 20200128)
Copyright © 1999-2020 The VirtualGL Project and many others (see README.txt)
Visit http://www.TurboVNC.org for more information on TurboVNC

01/04/2020 08:38:59 Using security configuration file /etc/turbovncserver-security.conf
01/04/2020 08:38:59 Enabled security type ‘tlsvnc’
01/04/2020 08:38:59 Enabled security type ‘tlsotp’
01/04/2020 08:38:59 Enabled security type ‘tlsplain’
01/04/2020 08:38:59 Enabled security type ‘x509vnc’
01/04/2020 08:38:59 Enabled security type ‘x509otp’
01/04/2020 08:38:59 Enabled security type ‘x509plain’
01/04/2020 08:38:59 Enabled security type ‘vnc’
01/04/2020 08:38:59 Enabled security type ‘otp’
01/04/2020 08:38:59 Enabled security type ‘unixlogin’
01/04/2020 08:38:59 Enabled security type ‘plain’
01/04/2020 08:38:59 Desktop name ‘TurboVNC: compute-19-10:1 (thomasbr)’ (compute-19-10:1)
01/04/2020 08:38:59 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
01/04/2020 08:38:59 Listening for VNC connections on TCP port 5901
01/04/2020 08:38:59 Interface 0.0.0.0
01/04/2020 08:38:59 Framebuffer: BGRX 8/8/8/8
01/04/2020 08:38:59 New desktop size: 800 x 600
01/04/2020 08:38:59 New screen layout:
01/04/2020 08:38:59 0x00000040 (output 0x00000040): 800x600+0+0
01/04/2020 08:38:59 Maximum clipboard transfer size: 1048576 bytes
01/04/2020 08:38:59 VNC extension running!

And scheduler error file

/export/uge/default/spool/compute-19-10/job_scripts/745449: line 2: module: command not found
/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/ede8e8e7-2d43-4486-b4ae-4f627b75917f/script.sh: line 7: module: command not found
websockify/websocket.py:30: UserWarning: no ‘numpy’ module, HyBi protocol will be slower
warnings.warn(“no ‘numpy’ module, HyBi protocol will be slower”)
WebSocket server settings:

  • Listen on :9576
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    /export/uge/default/spool/compute-19-10/job_scripts/745449: line 156: syntax error near unexpected token <' /export/uge/default/spool/compute-19-10/job_scripts/745449: line 156: done < <(tail -f --pid=${SCRIPT_PID} “vnc.log”) &’

OK we’re making progess!

Cool, so that is the line that keeps the job running, by having the outer/wrapper shell tailing a PID file. Seems like that’s a bash-ism done < <(command) . Is your default terminal something other than bash (sh maybe?).

It is bash:

[root@ivy submit]# echo $SHELL
/bin/bash

That’s the submit host or web server. What about the compute node that job ran on? Is this something UGE sets or changes during a job run?

The compute node uses bash and the UGE scheduler uses sh (#/bin/sh).

Can you add #!/bin/bash header in the script_wrapper element of your cluster file? This is my only guess, that it’s not recognizing bash as the interpreter and instead just using sh.

v2:
    batch_connect:
      basic:
        script_wrapper: |
          #!/bin/bash
          module purge
          %s
      vnc:
        script_wrapper: |
          #!/bin/bash
          module purge
          export PATH="/opt/TurboVNC/bin:$PATH"
          export WEBSOCKIFY_CMD="/opt/websockify/run"
%s

If this doesn’t work, then according to the UGE documentation we may be able to try -S /bin/bash to force it.

---
script:
  queue_name: "ondemand"
  accounting_id: "communitycluster"
  job_name: "desktop_interactive"
  native:
    - "-pe"
    - "sm"
    - "4"
    - "-S"
    - "/bin/bash"

Made both changes separately (and at the same time) and still getting the same error. Noticed that it had to shut down a previous Xvnc process.

output.log file:

Setting VNC password…
Starting VNC server…
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X1
Xvnc did not appear to shut down cleanly. Removing /tmp/.X1-lock

Desktop ‘TurboVNC: compute-19-10:1 (thomasbr)’ started on display compute-19-10:1

Log file is vnc.log
Successfully started VNC server on compute-19-10:5901…
Script starting…
Starting websocket server…
Launching desktop ‘mate’…
Scanning VNC log file for user authentications…

And scheduler error file:

/export/uge/default/spool/compute-19-10/job_scripts/745456: line 2: module: command not found

Warning: compute-19-10:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server compute-19-10:1
Killing Xvnc process ID 88157
Xvnc process ID 88157 already killed
/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/09d04aa7-7e48-4226-8687-0c90cbbf55f6/script.sh: line 7: module: command not found
websockify/websocket.py:30: UserWarning: no ‘numpy’ module, HyBi protocol will be slower
warnings.warn(“no ‘numpy’ module, HyBi protocol will be slower”)
WebSocket server settings:

  • Listen on :31013
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    /export/uge/default/spool/compute-19-10/job_scripts/745456: line 156: syntax error near unexpected token <' /export/uge/default/spool/compute-19-10/job_scripts/745456: line 156: done < <(tail -f --pid=${SCRIPT_PID} “vnc.log”) &’

I can tell you for sure that this is an interpreter issue.

Here I’ve got a simple test on a very simple file where if I invoke it with sh it fails and if I use bash it succeeds (syntactically anyhow).

Can you check your job_script_options.json to be sure that the -S /bin/bash is being set? Is your bash at /bin/bash or is it /usr/bin/bash?

I think that’s our best bet, to get UGE to enforce the interpreter.

Here is the job_script_options.json file for the Desktop App.

{
“job_name”: “desktop_interactive”,
“workdir”: “/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/04e27157-270e-495e-bb99-3164594704cf”,
“output_path”: “/home/thomasbr/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ivy/output/04e27157-270e-495e-bb99-3164594704cf/output.log”,
“shell_path”: “/bin/bash”,
“accounting_id”: “communitycluster”,
“wall_time”: 28800,
“queue_name”: “ondemand”,
“email_on_started”: false,
“native”: [
“-pe”,
“sm”,
“4”,
“-S”,
“/bin/bash”
]
}

We have bash at both /bin/bash and /usr/bin/bash.

Went ahead and created a scheduler script test_submit.sh script where we set the shell to bash (-s /bin/bash). Submitted via qsub test_submit.sh.

#!/bin/sh

#$ -V

#$ -cwd

#$ -S /bin/bash

#$ -o $JOB_NAME.o$JOB_ID

#$ -e $JOB_NAME.e$JOB_ID

#$ -q ivy

#$ -pe ivy 20

#$ -l h_vmem=1G

#$ -l h_rt=00:05:00

#$ -P hrothgar

./test

When it runs it still is showing the error even though when it is executed at the command prompt on the compute node with “bash test” or “./test” it doesn’t show the error. The UGE scheduler is not doing what it should be doing. Will talk to the systems people tomorrow.

Boston University uses SGE and we also ran into this problem. Our script_wrapper begins with:

set +o posix

This tells /bin/sh to behave like /bin/bash and allows process substitution to work. It also
looks like the module command isn’t defined. To deal with that you might also need to add:

. /etc/bashrc

What we actually do is add this:

. ~/.bashrc

which sources /etc/bashrc but also brings in some user settings (umask in particular).
hope this helps.

–Mike Dugan

1 Like

MIke,

Thanks for responding to the thread!

Added the “set +o posix” and “. ~/.bashrc” to the script_wrapper. The Desktop App is now working.

v2:
metadata:
title: “Ivy Cluster”
login:
host: “ivy.hpcc.ttu.edu”
job:
adapter: “sge”
cluster: “ivy”
bin: “/export/uge/bin/lx-amd64”
conf: “/tmp/test_sge.conf”
sge_root: “/export/uge”
libdrmaa_path: “/export/uge/lib/lx-amd64/libdrmaa.so”
batch_connect:
basic:
script_wrapper: |
set +o posix
. ~/.bashrc
module purge
%s
vnc:
script_wrapper: |
set +o posix
. ~/.bashrc
module purge
export PATH="/opt/TurboVNC/bin:$PATH"
export WEBSOCKIFY_CMD="/opt/websockify/run"
%s

Still seeing the warning in the UGE error file.

/export/uge/default/spool/compute-19-10/job_scripts/745456: line 2: module: command not found

@dugan thank you so much for commenting!!!

@thomasbrTTU you still have the module purge command in your script wrapper. It appears you don’t use modules so you can remove that line.

I will remove it. Have more questions about Jupyter notebooks. Will start a new post. Jeff thanks for all of your help.