Interactive Desktop

Hello
I’m trying to enable Interactive Desktop.
For the test purpose, 1 compute node in a cluster and linux_host adapter configured.
When ‘Launch’ selected, the following error returned although OOD_JOB_NAME_ILLEGAL_CHARS: “/” is in /etc/ood/config/nginx_stage.yml

Failed to submit session with the following error:
no implicit conversion of nil into String

  • If this job failed to submit because of an invalid job name please ask your administrator to configure OnDemand to set the environment variable OOD_JOB_NAME_ILLEGAL_CHARS.
  • The session data for this session can be accessed under the staged root directory.

How to fix?

Moved one step closer.
Installing Singularity container on the compute node removed the error above, but shows only
Your session is currently starting… Please be patient as this process can take a few minutes.
How to make the session ready and the ‘Launch remote desktop’ button available?

Look in the apps generated direcotry, link is in the starting section. There should be some indication on where it is hanging in the output file.

Apps go from starting to running when there’s a connection.yml written out in the output directory (the staged directory where all the job related files are - there’s a link to it in your job’s card).

We should automatically setup this file and write it automatically. I guess I’d make sure that you can write files to that location (deep in your home directory). We’ve seen issues with the overlay file systems where you write files in the container, but they’re not being synced to the actual filesystem.

You can see the troubleshooting guide here on how to re-run these jobs manually from a shell session instead of through OOD.

https://osc.github.io/ood-documentation/latest/installation/resource-manager/linuxhost.html#troubleshooting

In the output directory, there’s no connection.yml nor output file.

$ sudo ls -al /home/<myUserName>/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/chamber2mate/output/b5a9dff8-6853-4609-bb77-96e0cd20af06/
total 24
drwxr-xr-x 3 <myUserName> <myUserName>  149 May  4 05:19 .
drwxr-xr-x 9 <myUserName> <myUserName>  314 May  4 05:19 ..
-rwxr-xr-x 1 <myUserName> <myUserName>  100 May  4 05:19 before.sh
drwxr-xr-x 2 <myUserName> <myUserName>   66 Apr 27 07:36 desktops
-rw-r--r-- 1 <myUserName> <myUserName> 6943 May  4 05:19 job_script_content.sh
-rw-r--r-- 1 <myUserName> <myUserName>  389 May  4 05:19 job_script_options.json
-rwxr-xr-x 1 <myUserName> <myUserName>  540 May  4 05:19 script.sh
-rw-r--r-- 1 <myUserName> <myUserName>   27 May  4 05:19 user_defined_context.json

and /var/log/ondemand-nginx/<myUserName>/error.log includes below repeatedly

App 3149 output: [2021-05-04 05:48:22 +0000 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=713.12 view=13.53”
App 3149 output: [2021-05-04 05:48:33 +0000 ] INFO “execve = [{}, “ssh”, “-t”, “-o”, “BatchMode=yes”, “-o”, “UserKnownHostsFile=/dev/null”, “-o”, “StrictHostKeyChecking=no”, “@<submit_host=ssh_hosts>”, “tmux”, “list-panes”, “-aF”, “\\#\\{session_name\\}\\\u001F\\#\\{session_created\\}\\\u001F\\#\\{pane_pid\\}”]”

Seems like the job is not starting at all via your resource manager? Assuming your using one, any info in your slurmctld logs (if using slurm)?

Yea, all those files you have there are written on the OOD host. the connection.yml file is written during the job execution on the destination host.

I would turn debug on for the linux host adapter which will keep the shell scripts that launch this job in that directory (or your HOME depending on the version). Try to execute the tmux and singularity shell scripts as described in the troubleshooting guide manually adding echos and print statements to get a better picture of what’s going on.

Again, I’d imagine it’s an issue with the singularity container - that you’re writing those files, but they’re only being written to the overlay within the container and not the actual filesystem.

In fact, this tmux session may still be running so you may be able to just connect into it and the singularity container and see that if that is the case.

1 Like