Linux host adapter errors

I am trying to set up the Linux host adapter to our dedicated interactive nodes.

We set up the host based authentication and I can ssh from the ondemand server to the interactive node fine w/o password, e.g. with the command that OOD uses:
ssh -t -o “BatchMode=yes” -o “UserKnownHostsFile=/dev/null” -o “StrictHostKeyChecking=no” u0101881@frisco1.chpc.utah.edu

I also created the cluster.d config file, and the bc_desktop config file, and I think they are in a reasonable shape (not quite sure about the clusters.d where I also added the batch_connect section so that I can add the script_wrapper pieces, e.g. PATH to TurboVNC, WEBSOCKIFY_CMD,…, but, I think that should be correct, comparing to the scheduler based batch setup).

Now, when I push the button to start the interactive desktop, I get, in the OOD window:
Pseudo-terminal will not be allocated because stdin is not a terminal.
Warning: Permanently added ‘frisco1.chpc.utah.edu,155.101.26.201’ (ECDSA) to the list of known hosts.
Illegal variable name.
Badly placed ()'s.
Unmatched ".

I presume the first two lines are just a warning, but the other 3 suggest that something is wrong with some shell script. I went as far as injecting #!/bin/bash to the job_script_content.sh with no effect, and there’s no output.log so I suspect that this error is coming from somewhere before any of the job scripts get executed. My default shell is tcsh, so, I suspect somewhere there is a “#!/usr/bin/env bash” missing.

Any thoughts on that?

Thanks,
MC

I would say turn on the debug flag (in the configuration) and debug the files that’ll drop 2 files in your home directory.

One is the outer wrapper (ending in _tmux) and this is fed to the ssh command in std in. The other is what get’s executed within the singularity container (ending in _sing).

My guess is you’re failing on the outer script. The shebang is #!/bin/bash so I guess we could be getting into trouble there (instead of using env).

But that’s where I would start, in this way you can reproduce the error in an interactive terminal modifying the file to see what’s wrong.

Hi Jeff,

I am having the debug: true, and I am not seeing any _tmux file in the temp directory. This is what I have:
[05064afe-db1d-4d5b-844b-0951775c0c9e]$ ls -l
total 24
-rwxr-xr-x 1 u0101881 chpc 100 May 1 22:52 before.sh
drwxr-xr-x 2 u0101881 chpc 68 Apr 24 22:35 desktops
-rw-r–r-- 1 u0101881 chpc 7289 May 1 22:52 job_script_content.sh
-rw-r–r-- 1 u0101881 chpc 488 May 1 22:52 job_script_options.json
-rwxr-xr-x 1 u0101881 chpc 658 May 1 22:52 script.sh
-rw-r–r-- 1 u0101881 chpc 55 May 1 22:52 user_defined_context.json

[5064afe-db1d-4d5b-844b-0951775c0c9e]$ ls -l desktops/
total 12
-rwxr-xr-x 1 u0101881 chpc 736 Apr 23 20:25 gnome.sh
-rwxr-xr-x 1 u0101881 chpc 1324 Apr 23 20:25 mate.sh
-rwxr-xr-x 1 u0101881 chpc 1544 Apr 23 20:25 xfce.sh

My colleague Brett who has bash as default got further, his output.log got generated but is empty. His error in the browser just said that the job entered bad state.

I think we may have something with our configuration. I was not sure about a few things in your docs. I’ll detail that in a next post.

In setting the host adapter, from the docs at https://osc.github.io/ood-documentation/release-1.7/installation/resource-manager/linuxhost.html#resource-manager-linuxhost, there’s no mention to create a remote desktop config file. Please link this to the aforementioned page: https://osc.github.io/ood-documentation/release-1.7/enable-desktops/modify-form-attributes.html#minimal-linuxhost-form.

In our case, we have custom location of TurboVNC and Websockify, so, our bc_desktop/frisco.yml looks like:

title: "Frisco Desktop"
cluster: "frisco"
submit: "linux_host"
attributes:
  bc_queue: null
  bc_account: null
  bc_num_slots: 1
  num_cores: none

while the clusters.d/frisco.yml looks like:

v2:
  metadata:
    title: "Frisco"
    url: "https://www.chpc.utah.edu/documentation/guides/frisco-nodes.php"
    hidden: false
  login:
    host: "frisco1.chpc.utah.edu"
  job:
    adapter: "linux_host"
    submit_host: "frisco1.chpc.utah.edu"  # This is the head for a login round robin
    ssh_hosts: # These are the actual login nodes
      - frisco1.chpc.utah.edu
      - frisco2.chpc.utah.edu
      - frisco3.chpc.utah.edu
      - frisco4.chpc.utah.edu
      - frisco5.chpc.utah.edu
      - frisco6.chpc.utah.edu
      - frisco7.chpc.utah.edu
      - frisco8.chpc.utah.edu
    site_timeout: 7200
    debug: true
    singularity_bin: /uufs/chpc.utah.edu/sys/installdir/singularity3/std/bin/singularity
    singularity_bindpath: /etc,/mnt,/media,/opt,/run,/srv,/usr,/var,/uufs,/scratch
    singularity_image: /opt/ood/linuxhost_adapter/centos7_lmod.sif
    # Enabling strict host checking may cause the adapter to fail if the user's known_hosts does not have all the roundrobin hosts
    strict_host_checking: false
    tmux_bin: /usr/bin/tmux
  batch_connect:
    basic:
      script_wrapper: |
        #!/bin/bash
        set -x
         if [ -z "$LMOD_VERSION" ]; then
            source /etc/profile.d/chpc.sh
         fi
        export XDG_RUNTIME_DIR=$(mktemp -d)
        %s
      set_host: "host=$(hostname -A | awk '{print $2}')"
    vnc:
      script_wrapper: |
        #!/bin/bash
        set -x
        export PATH="/uufs/chpc.utah.edu/sys/installdir/turbovnc/std/opt/TurboVNC/bin:$PATH"
        export WEBSOCKIFY_CMD="/uufs/chpc.utah.edu/sys/installdir/websockify/0.8.0/bin/websockify"
        export XDG_RUNTIME_DIR=$(mktemp -d)
        %s
      set_host: "host=$(hostname -A | awk '{print $2}')"

I am setting the num_cores since our default bc_desktop/submit.yml.erb has it as:

batch_connect:
  template: vnc
script:
  native:
    <%- if num_cores != "none" -%>
    - "-n <%= num_cores %>"
    <%- end -%>

Please, let me know how these config files look like to you. I have a feeling that I am missing something since I don’t see any singularity call in any of the job scripts. And I can ssh to the container fine (and thanks to all the bind mounts it can see all our software stack - that’s a great idea that I’ll keep in mind for future similar projects).

Also, any way to get more default debug info? I just get the script files as listed and the Apache logs that I pasted earlier.

OK, it seems unfortunately it writes out those debug files on the login host, after it ssh’s into it.

This is the very begging of what it’s trying to do (the full file being here), and it’s failing even trying to come up with these variables (that’s the illegal variable name from the $(), looks like tsch only accepts ``)

#!/bin/bash

singularity_tmp_file=$(mktemp -p "$HOME" --suffix '_sing')
tmux_tmp_file=$(mktemp -p "$HOME" --suffix "_tmux")

Even though it has a shebang header, that’s not really accounted for in std in.

We’re effectively doing something like this: cat test.sh | ssh user@host when it appears we should be doing cat test.sh | ssh user@host /bin/bash to force bash execution (or something similar).

I don’t think you’ll make it very far in tcsh at the moment because it looks like that initial script we’re trying to execute over ssh isn’t tcsh compliant and as you indicate, it can’t even do that initial set of writing out the 2 files (through cat heredocs) and then executing them.

That said, I think your config looks OK and your colleague with bash may have luck. I’m looking into tcsh compliance now.

Just for historical context, the LHA didn’t work with some shells like tsch. The issue was ultimately fixed. Thanks @mcuma for testing and bringing this bug to our attention!