Launch app on specific host skipping the scheduler

Hi All,

I have a cluster with various fileservers that are sometimes used for compute instead of the general compute nodes in slurm that I would like to be able to launch jupyter notebooks on. I suppose I could put them in slurm but given the “linux_host” adapter, I was wondering if there was a way to do i without adding these fs’s to slurm?

Yea I think the linux host adapter may suite this use case, because it’s meant for login nodes or at least nodes that are not a part of the general compute infrastructure.

It does depend on a few things like tmux and singularity on the destination server, so just keep that in mind.

@jeff.ohrstrom The “cluster” setting in an app’s form.yml is immutable by the form page right? i.e. I cannot dynamically change its value with a form field/javascript?

No, you have a couple of options here in 1.8+.

Adding additional clusters in cluster attribute will give you a dropdown which you can interact with through javascript.

https://osc.github.io/ood-documentation/latest/app-development/interactive/form.html#configuring-which-cluster-to-submit-to

There’s a little complexity here though given how different adapters require fields. That is, if you were to say set a slurm cluster A and a linux host adapter B they’d want very different things in the native field of the submit.yml.erb. You can check this out as a reference on how to toggle these. At one point we had a Slurm cluster and a Torque cluster at the same time, so we accessed this information through OodAppkit.clusters[cluster].job_config[:adapter] and submitted different native args based on that.

1 Like

I think this is exactly what I needed. Key was the fact that simply adding multiple clusters to the clusters var in the form.yml was already built in!

Thanks for the submit.yml.erb snippet as well, that is how I will handle the native bits when submitted to slurm and nothing when sent to a linux_host adapter!

@jeff.ohrstrom I am getting pretty far with this but am now stuck with the apps launching correctly (I can see jupyter running as my user on the target host) and I can manually enter the url so the reverse proxy works but the apps in the portal go right to completed with no connect button.

I am guessing this is because the connector is failing to communicate with the process but I cannot seem to figure out what is blocking it? Any ideas?

Here’s a troubleshooting section. I’ve noticed a similar behaviour and have that section here, where ‘it just exists immediately’. There are steps to debug, but as an off the top guess, I’d scrutinize the submit_host and the ssh_hosts. ssh_hosts should be any host the submit_host can DNS resolve to.

https://osc.github.io/ood-documentation/latest/installation/resource-manager/linuxhost.html#troubleshooting

Hi @jeff.ohrstrom

Do I need to run an app intended for a linux host in a specific container like the wiki states by adding singularity_container: /usr/local/modules/netbeans/netbeans_2019.sif line to the native override in the submit.yml.erb in an app?

No, I think we run a base centos:7 image and just mount in everything we need. We really only use it for process management more than anything else.

So you could either use a basic image and mount in what you need (like we do for code-server) or you can have a specific image that holds what you need and have fewer mount ins. Totally up to you.

Ok cool.

Still trying to figure out why these jobs are going straight to completed even though the singularity container and its internal process are running fine. I can even manually enter the uri to redirect to the jupyterlab that is running on the node.

I can’t quite find where in the code how the state is being determined. I see ood_core/status.rb at master · OSC/ood_core · GitHub is handling state info for other parts to query but I don’t see the logic that does the actual test?

Do you happen to know off the top of your head what is being tested/queried on the node to determine state? I’m guessing its checking the PID of the singularity command?

Thanks!

@jeff.ohrstrom I just noticed that the tmp.XXXXX_tmux script is being generated with timeout 0s ..... which I am guessing is why the job is just completing instantly. Do you know where this is being set? I notice that in the documentation: Configure LinuxHost Adapter (beta) — Open OnDemand 1.8.12 documentation the example has a very large timeout set.

It looks like that timeout gets populated by the “site_timeout” setting in the cluster config yml, based off of ood_core/launcher.rb at fc8c05badb329817a04437f1736f09d1519a239d · OSC/ood_core · GitHub

Any idea what might be hard coding this to “0s”?

Looks like site_timeout is defaulting to 0 which is wrong. I’ve filed a bug for the same. You should set site_timeout to something (7200 in the example config is 2 hours).

yeah thats what I have it set for, just like the example. Wonder if something in m config is causing it to ignore that variable in the yml

OK - there could be a second bug here. Are you submitting the job with any walltime? You may need to do that.

I have not set anything for a wall time so no. Should I just set a “walltime: x” in the script → native section of the submit.yml?

You should be able to use bc_num_hours.

Or if you have walltime in the form, you can use it like this:

script:
  wall_time: <%= walltime %>

Can confirm this fixes the 0s timeout. Now onto debugging why tmux is crashing instantly with no output haha

Hmm, might have spoke too soon. I had an issue in my config that broke singularity from starting. The timeout is still fixed but still going right to complete but everything looks to start and run on the target host.

Ok well at this point, the app goes to completed immediately but if I manually change the url to the proxy and point on the target host, I can get there…

What version of ondemand are you on and whats the wrong and right urls?