OOD as a teaching tool

Hi all,

I’m coming from a more academic setting than research/HPC, but I’m still interested in this project. I maintain a cluster of 32 identical nodes that uses a weighted round robin to distribute students to various nodes – no scheduler, no exclusive access. Process clean up is handled by automated scripts. I would like to use OOD as a web portal to: access files and ephemeral shell (both of these work great with very little effort). I would also like to offer through OOD long lasting tmux and vnc/IDE sessions that students can re-attach to easily. I’m first working on trying vnc sessions. The linuxhost adapter seems to be the best fit, but I’m having some issues. I created a simple xfce desktop app. The session creates fine, and I can see the tmux session running Xvnc server on the node, but the “My Interactive Sessions” immediate shows the session as “Completed” instead of active so there is no “Launch” button. “Active jobs” shows the job as running, but “Time Used” is 44000+. I verified that the session itself is running without errors, and I can even connect to the websocket with noVNC after I do one additional step. I looked through the source code to figure out the Launch URL: /pun/sys/dashboard/noVNC-1.1.0/vnc.html?.. this launches noVNC set to connect to the reverse proxy websocket. For me, noVNC times out until I manually load the reverse proxy URL (/rnode///websockify) in my browser which of course gives a method not allowed error. However after that noVNC connects through just fine.

Is anyone using OOD as a teaching tool instead of a HPC tool? Maybe I’m missing it, but I’m not seeing support for connecting interactive shell sessions? Any suggestions for doing a “simple” tmux interactive session?

I’m just using a single node to simplify debugging. OOD server is CentOS 7, and cluster nodes are CentOS 8.

#clusters.d/birch.yml
---
v2:
  metadata:
    title: "birch"
  login:
    host: birch.local
  job:
    adapter: "linux_host"
    submit_host: "birch.local"
    ssh_hosts:
      - birch.local
    site_timeout: 7200
    debug: true
    singularity_bin: /usr/bin/singularity
    singularity_bindpath: /run,/etc,/home,/web,/opt,/usr,/var
    singularity_image: /home/shared/ood/centos8.simg
    strict_host_checking: false
    tmux_bin: /usr/bin/tmux
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
    vnc:
      script_wrapper: |
        module purge
        export PATH="/opt/TurboVNC/bin:$PATH" 
        export WEBSOCKIFY_CMD="/opt/websockify/run"
        %s 
#apps/bc_desktop/birch.yml
---
title: "Graphical Desktop"
cluster: "birch"
form:
  - desktop
  - bc_num_hours
attributes:
  bc_num_hours:
    value: 8
  desktop: "xfce"

Thanks!

Hi and welcome! Lots to unpack here, but quickly - yea we totally think OOD is a great choice as a teaching tool. We ourselves have a separate deployment at that has fewer apps than our regular deployment. And those apps have fewer options that launch very contained/specialized environments which instructors use for classes at Universities in Ohio.

So, to your issue then. That’s odd that it would complete immediately in your interactive sessions but actually launch and be available. So I’d say let’s triage that first, then maybe we can tackle the timeouts.

You’re configuration looks OK, but I would ask if birch.local is a single host or if it can DNS resolve to
several. There is a troubleshooting section for this adapter that may provide some help. If birch.local can resolve to more than 1 host, you’ll need to specify all the hosts it can resolve to in v2.job.ssh_hosts.

There’s a link to the staged root in the card. If you follow that link to the file directory, you’ll see a connection.yml file. Is the correct host in this file?

I would also question DNS resolution. From the OOD server you should be able to run ssh myuser@birch.local for all the ssh_hosts you’ve configured.

https://osc.github.io/ood-documentation/master/installation/resource-manager/linuxhost.html#troubleshooting

Jeff,

Yes, the hostname resolves to a single host, and yes hostname in connection.yml matches. No problem SSHing from OOD host to login node.


(yes hostname is actually birch.rlogin not birch.local)

Based on the source code: ood_core/launcher.rb at c6c833d343931521428d2f2bfddb8ed2c3468843 · OSC/ood_core · GitHub

Here is what tmux outputs:

and contents of ~ondemand/data/sys/dashboard/batch_connect/db/aab258f9-72ac-46d3-8546-89ccbfadf740

{"id":"aab258f9-72ac-46d3-8546-89ccbfadf740","cluster_id":"birch","job_id":"launched-by-ondemand-a2879247-eb97-450c-a102-3d1eb180e5bf@birch.rlogin","created_at":1616701486,"token":"sys/bc_desktop/birch","title":"Graphical Desktop","view":null,"script_type":"vnc","cache_completed":true}

I did read through the troubleshooting, but the session itself seems to be running just fine

Thanks for your time!

OK there has to be some sort of parsing error. As you indicate, that’s is similar to what we use for tmux output, but you use commas as a separator and we seem to use some non ascii character ("\x1F")

I wonder what you’re LANG is and if that’s affecting the output. Mine is en_US.UTF-8 and it seems to work OK. I’m wondering if the tmux separator is somehow throwing the parsing off where it can’t find the id and therefore assumes that the job is complete.

declare -x LANG="en_US.UTF-8" on both OOD server and the node(s).

Sorry for the confusion, I couldn’t tell by the source code what separator was used, so I just ran the command with comma for readability. Any suggests on how to track down the parsing error? I am not a ruby expert, and have trouble following along all the source. Here are some sanitized logs from the user’s pun, looks like there was a redirect, but I don’t see any errors logged.

App 25868 output: [2021-03-26 11:10:22 -0400 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=3.92 view=0.28"
App 25868 output: [2021-03-26 11:10:28 -0400 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 25868 output: [2021-03-26 11:10:28 -0400 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/birch/session_contexts/new format=html controller=BatchConnect::SessionContextsController action=new status=200 duration=38.99 view=16.68"
App 25868 output: [2021-03-26 11:10:33 -0400 ]  INFO "execve = [{}, \"ssh\", \"-t\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=no\", \"peontest@birch.rlogin\", \"/usr/bin/env\", \"bash\"]"
App 25868 output: [2021-03-26 11:10:34 -0400 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/birch/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=441.49 view=0.00 location=https://ondemand.host/pun/sys/dashboard/batch_connect/sessions"
App 25868 output: [2021-03-26 11:10:34 -0400 ]  INFO "execve = [{}, \"ssh\", \"-t\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=no\", \"peontest@birch.rlogin\", \"tmux\", \"list-panes\", \"-aF\", \"\\\\#\\\\{session_name\\\\}\\\\\\u001F\\\\#\\\\{session_created\\\\}\\\\\\u001F\\\\#\\\\{pane_pid\\\\}\"]"
App 25868 output: [2021-03-26 11:10:34 -0400 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 duration=259.86 view=7.72"
App 25868 output: [2021-03-26 11:10:46 -0400 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=8.69 view=4.05"

I tried recreating the tmux list-panes command as much as I can figure out from the log on the OOD server. Running as the user on the OOD server:

[peontest@ondemand ~]$ ssh -t -oBatchMode=yes -oUserKnownHostsFile=/dev/null -oStrictHostKeyChecking=no peontest@birch.rlogin tmux list-panes -aF \"\#\{session_name\}\u001F\#\{session_created\}\u001F\#\{pane_pid\}\"                                        
Warning: Permanently added 'birch.rlogin,192.168.5.100' (ECDSA) to the list of known hosts.
launched-by-ondemand-5cfeb79f-76a6-453c-b92f-158fa90b56f3u001F1616771434u001F3048321
Connection to birch.rlogin closed.

Thanks

No issue at all with the separator. And thank you for replicating the ssh command. With that output, I believe we’ve found a bug that I’m able to replicate in test cases. It seems it doesn’t like that Warning message and fails to deal with it properly.

You may be able to reconfigure this so that your known hosts file is not /dev/null and therefore you’ll have host known before the ssh command and therefore won’t get the warning.
strict_host_checking: true

Well actually the test case I wrote was the issue. And that warning is in stderr not stdout, so I’m not entirely sure that is the issue.

I’ll write a small ruby file for you to replicate and try to put debugging information into.

I hit enter too early, so I hope you see this update.

Here’s a small script we can use to debug a little bit. You may have to gem install shellwords and gem install open3. Then you can just ruby delme.rb to execute it.

Note the spots you’ll have to change from my username and the host I’m checking on. I’ve marked them with CHANGME.

puts is the function to print output. I’ve added one puts statement there printing the raw line we should be seeing. Of course, \x1F doesn’t actually show up in the output, but you can modify that and test to see if different seperators behave differently.

#!/usr/bin/env ruby

require 'shellwords'
require 'open3'

UNIT_SEPARATOR = "\x1F"

def strict_host_checking
  false
end

def username
  # CHANGEME
  'johrstrom'
end

def hostname
  #CHANGME
  'pitzer-login01.hpc.osc.edu'
end

def session_name_label
  'launched-by-ondemand'
end

def call(cmd, *args, env: {}, stdin: "")
  args  = args.map(&:to_s)
  env = env.to_h
  o, e, s = Open3.capture3(env, cmd, *args, stdin_data: stdin.to_s)
  s.success? ? o : raise(StandardError, e)
end

def ssh_cmd(destination_host, cmd)
  if strict_host_checking
    [
      'ssh', '-t',
      '-o', 'BatchMode=yes',
      "#{username}@#{destination_host}"
    ].concat(cmd)
  else
    [
      'ssh', '-t',
      '-o', 'BatchMode=yes',
      '-o', 'UserKnownHostsFile=/dev/null',
      '-o', 'StrictHostKeyChecking=no',
      "#{username}@#{destination_host}"
    ].concat(cmd)
  end
end

def list_remote_tmux_session(destination_host)
  # Note that the tmux variable substitution looks like Ruby string sub,
  # these must either be single quoted strings or Ruby-string escaped as well
  format_str = Shellwords.escape(['#{session_name}', '#{session_created}', '#{pane_pid}'].join(UNIT_SEPARATOR))
  keys = [:session_name, :session_created, :session_pid]
  cmd = ssh_cmd(destination_host, ['tmux', 'list-panes', '-aF', format_str])

  call(*cmd).split("\n").map do |line|
    puts "splitting line '#{line}'"
    Hash[keys.zip(line.split(UNIT_SEPARATOR))].tap do |session_hash|
      session_hash[:destination_host] = destination_host
      session_hash[:id] = "#{session_hash[:session_name]}@#{destination_host}"
    end
  end.select do |session_hash| 
    session_hash[:session_name].start_with?(session_name_label)
  end
end

sessions = list_remote_tmux_session hostname

sessions.each do |session|
  puts "found session #{session}"
end

Here is the output I got, which appears to be working:

$ ruby delme.rb                                                                                                                                                                                                                                                            
ssh command ["ssh", "-t", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=no", "peontest@birch.rlogin", "tmux", "list-panes", "-aF", "\\#\\{session_name\\}\\\u001F\\#\\{session_created\\}\\\u001F\\#\\{pane_pid\\}"]
splitting line 'launched-by-ondemand-ff4e2080-bea3-4da1-ab19-3ff2ce377bab1616781871503005'
found session {:session_name=>"launched-by-ondemand-ff4e2080-bea3-4da1-ab19-3ff2ce377bab", :session_created=>"1616781871", :session_pid=>"503005", :destination_host=>"birch.rlogin", :id=>"launched-by-ondemand-ff4e2080-bea3-4da1-ab19-3ff2ce377bab@birch.rlogin"}

I added puts "ssh command #{cmd}" to see the ssh command being run.

So I have done some more testing. I spun up a couple of VMs to act as test compute nodes. One as CentOS 7 and the other CentOS 8 and installed the minimal needed. On the CentOS 7 “compute” node, the sessions seem to work as I would expect them to. On the CentOS 8 node, it is the same as what I was seeing before, the session just immediately comes back as “completed.” I have not been able to figure out why yet. So far, the only problem I have found is that CentOS 8 is running tmux 2.7 which changes the error message given when there is no tmux session to error connecting to /tmp//tmux-<uid>/default (No such file or directory) I don’t think this is what is causing the problem through. I modified the code to ignore this error instead and still see the same behavior.

Chris

Thanks, I’d guess there’s something to that! I’ll check if I can get a higher version of tmux to test with here. If you can hack away and add puts statements all over the place and maybe see where it’s being dropped that’d be great.

OK I made a version of tmux 2.7 and was able to replicate this issue. Apparently it doesn’t like that field seperator and instead prints underscores. So even though we say one thing, it prints another and therefore can’t parse it. I’m submitting a patch shortly, but you should be able to edit this line and use a comma instead. I need to sort out what will be more reliable, but I don’t believe offhand bash (or any other shell) interprets commas as anything important.

Jeff,

Thanks for taking time to look into this. I can confirm that changing the separator to “,” fixes the issue I’m seeing with launching sessions on CentOS 8. I had created this simple ruby script for debugging, but I still didn’t see that issue:

require 'ood_core'
require 'ood_core/job/adapters/linux_host'

local = OodCore::Job::Adapters::LinuxHost::Launcher.new(
                  debug: true,
                  singularity_bin: '/usr/bin/singularity',
                  singularity_image: '/home/shared/ood/centos8.simg',
                  singularity_bindpath: '/etc,/home,/opt,/run,/srv,/var,/usr',
                  ssh_hosts: ['test1.rlogin'],
                  submit_host: 'test1.rlogin',
                  strict_host_checking: false,
                  tmux_bin: '/usr/bin/tmux'
                )
local.instance_variable_set(:@username, 'peontest')
script = OodCore::Job::Script.new(
                  content: 'nc -l -p 6001',
                  wall_time: 300,
                  output_path: '/tmp/output.log',
                  error_path: '/tmp/output.log',
                )
id = local.start_remote_session(script)
puts "Created session #{id}"
sessions = local.list_remote_sessions
puts "Sessions #{sessions}"

There is one more thing that I’ll need to patch, which is the new language/text for the given error messages - so that may cause some issues for you in the interim.

Jeff,

Thanks for all your help. The VNC timeout issue was due to an auth redirect (redirects back to https instead of wss). I was able to get around this by changing the authentication scope. I now have VNC sessions working with linuxhost adapter. My only other question is about tmux sessions, but I will create a new thread for that.