OOD server offload

My colleague attended the OOD BoF at SC19 where it was mentioned that you have ready a beta of OOD job offload to a standalone Linux server. We would like to try this on our systems. Would you be able to get us set up with that?

Thanks,
MC

Martin,

We’re calling the new feature the LinuxHost adapter. We just got a final thumbs up from our head system’s architect today. Docs are in-progress (Github), and we’ll have an RPM available tomorrow.

ps. I just noticed that Github’s RST compiler doesn’t render our code sections properly; the raw file is viewable here: https://raw.githubusercontent.com/OSC/ood-documentation/linuxhostadapter/source/installation/resource-manager/linuxhost.rst

See https://osc.github.io/ood-documentation-test/linuxhostadapter/installation/resource-manager/linuxhost.html for a test build of the documentation in progress.

Martin,

The OnDemand latest repo now has a 1.7.4 build which is the first build where the new adapter is available. You can try the latest by using https://yum.osc.edu/ondemand/latest/ondemand-release-web-latest-1-6.noarch.rpm. 1.7.4 has not been tested in prod at OSC, and so we are not yet recommending it for production elsewhere, but if you want to try it out you are welcome. Please, let us know how it goes.

Hi guys, I did not have the courage to install the new OOD yet (to be safe want to do it on a new VM so need to coordinate with our VM guys). But, we have read the LinuxHost Adapter documentation and have a few questions on it related to how we’d plug it to our existing infrastructure.

First, we are using a home grown tool called Arbiter (https://dl.acm.org/citation.cfm?id=3333043) to limit abuse of interactive nodes, which uses user cgroups. To avoid conflicts with that tool, we are wondering if it would be possible to tell the LinuxHost Adapter to not enforce any limits, and leave that to the third party tools like Arbiter.

Second, we are wondering if the Adapter keeps any records of jobs running on the interactive node - in order to prevent too many jobs from running on a single node and OOMing it. We can enforce cgroup per user limits through OOD or Arbiter, but what if too many users get offloaded to a single server?

Thanks,
MC

Hi Martin,

The LinuxHost adapter does not enforce any resource limits other than a ‘wallclock’ limit which is simply a timeout command (link to source). We encourage user limits, but do not require them. OnDemand itself does not know anything about user resource limits.

There is no feature in the adapter to manage the total number of jobs run on the interactive node; indeed this adapter is unusual in that when it lists all jobs it will only ever list $USER's jobs because of the way tmux works.

With the disclaimer that I am not a sysadmin or cgroups expert I believe that there is a way to limit all resource utilization by the adapter by wrapping the tmux or singularity executable that the adapter uses and targeting that executable for limits: https://osc.github.io/ood-documentation/develop/installation/resource-manager/linuxhost.html#approach-2-libcgroup-cgroups.

ps I should also say that the linked section of documentation was written by a sysadmin: @tdockendorf.

Hi Morgan,

thanks, sounds good. That makes more sense now.

One other question. Is it possible to have more than one submit_host? And/or to select what particular ssh_host to use?

The reason for this question is that we have 8 interactive only nodes, named frisco[1-8].chpc.utah.edu, that don’t face any cluster and which we use for interactive work. Each of these machines has different hardware specs, so, users commonly request a specific machine (e.g. frisco8 that has more CPUs/memory if they need them).

Thanks,
MC

I didn’t anticipate that as a feature; I have an idea about how I could make the second part work.

Right now you could target nodes individually by giving them each a different cluster config, but that’s not an ideal solution because the Batch Connect framework does not yet* support changing the target cluster.

* I am not sure when we will support changing clusters from Batch Connect forms.

Sounds good, I’ll mess with it once we get that VM set up - which may take a little while since we want to make it a part of a large project.

In either case, we would really love to have the choice of different target clusters in Batch Connect, we’re waiting for that to make the Interactive Apps interface simpler for the 6+ clusters we have. If you could prioritize this that would be great.

Thanks,
MC

Hi Morgan,

one more question. How do you handle the authentication on the ssh_host going from the OOD server? It looks like it’s ssh keys, but, we’re not sure from the docs. And, if so, are they user or host based?

Thanks,
MC

Martin,

The adapter does indeed rely on SSH keys. At OSC we have host based keys set up. I have opened an issue against our docs to clarify that.

I also put together a pull request to add an override for submit_host to the adapter. What it does is allow you to set a “native” attribute in a Batch Connect app’s submit.yml.erb and that would then send the job to a specific host. The overridden host must be in the ssh_hosts list.

If you wanted to demo that a sudoer should create /etc/ood/config/dashboard/initializers/add_submit_host_override.rb with the following content:

require "ood_core"
require "ood_core/job/adapters/linux_host/launcher"

class OodCore::Job::Adapters::LinuxHost::Launcher
  # @param hostname [#to_s] The hostname to submit the work to
  # @param script [OodCore::Job::Script] The script object defining the work
  def start_remote_session(script)
    cmd = ssh_cmd((script.native && script.native['submit_host_override']) ? script.native['submit_host_override'] : submit_host)

    session_name = unique_session_name
    output = call(*cmd, stdin: wrapped_script(script, session_name))
    hostname = output.strip

    "#{session_name}@#{hostname}"
  end
end

Assuming that your form control is named target_submit_host your Batch Connect app’s submit.yml.erb would then add the native attribute like so:

batch_connect:
  # ...
script:
  # ...
  native:
    submit_host_override: <%= target_submit_host %>

Sounds good, thanks. I am planning to roll a new OOD server with the host adapter after the Holidays so I’ll be in touch once we have that. Happy Holidays.

This feature is a part of the core library now. Just FYI so you don’t need that initializer now.

You can directly do something like this.

batch_connect:
  # ...
script:
  # ...
  native:
    submit_host_override: <%= target_submit_host %>

Jeff,

Should this submit_host_override be working in v1.8.12 ?

I’m trying with below in my submit.yml.erb and I’m getting “no implicit conversion of Hash into Array”

batch_connect:
  # We use the basic web server template for generating the job script
  #
  # @note Do not change this unless you know what you are doing!
  template: "basic"

script:
  native:
    submit_host_override: "n0003.testbed0"

Thanks for any tips.
–Krishna.

This answer was for the linux_host adapter type. What kind of cluster are you trying to submit this with?

Yeh I’m using ‘slurm’ adapter in my cluster so nevermind. I incorrectly inferred that you might be saying that submit_host_override will work for slurm adapter also. I will go ahead and start playing with linux_host adapter in my deployment now.