Multiple clusters from a single app

mcuma · June 4, 2020, 9:50pm

I’ll blame Matthew and Brandon from Idaho National Lab for this but am asking the OOD developers for help

The issue is, INL has successfully modified their Desktop and Jupyter app to include multiple clusters in a single app, by an extra parameter in the app’s form.yml (or desktop’s cluster.yml), like
pbs_cluster:
widget: select
label: “Cluster”
options:
- [“Cluster1”, “cluster1”]
- [“Cluster2”, “cluster2”]
…

and then feeding this pbs_cluster variable to the submit.yml.erb:
- “-q”
- “<%= pbs_queue %>@<%= pbs_cluster %>”

We use SLURM so this PBS solution does not work for us, but, we do use a single slurmdbd for all our clusters so we can cross-submit jobs with the -M flag (sbatch -M cluster1 …).

So, I added into our setup a generic cluster that uses SLURM binaries that work across all our clusters, and use -M in the submit.yml.erb to direct the job to a specific cluster:
script:
native:
- “-M”
- “<%= slurm_cluster %>”

In the process, I discovered that OOD has in ln 279 of gems/ood_core-0.11.3/lib/ood_core/job/adapters/slurm.rb hard coded the -M flag:
args += ["-M", cluster] if cluster

I tried to comment out the flag (since I feed it in through the submit.yml.erb), and, that does submit the job with the app (desktop) starting on the compute node correctly, but, OOD Interactive Sessions does not know about this job. I suspect because OOD behind the scenes queries the SLURM about the job status and since I removed the -M from the SLURM adapter, the commands like squeue don’t have the appropriate cluster name.

Perhaps there could be a simple fix to this for us (rather than waiting for future OOD release that should allow this), which is I am asking for feedback.

I guess the simplest way would be to set the cluster variable in the slurm.rb to the slurm_cluster variable that I define in the submit.yml.erb, but, I don’t know the complexities of how all these things interact behind the scenes.

I appreciate any thoughts on this.

MC

jeff.ohrstrom · June 5, 2020, 9:32pm

I’m actually surprised this works for PBS. Maybe it works for PBS only because 1 binary can submit to and query multiple clusters? So the cluster_id in their database file (~/ondemand/data/sys/dashboard/batch_connect/db/) is incorrect, but qstat` continues to search for that job id on other clusters, finds it and returns the data to OOD? (I’m only guessing as to why it works for them)

So, if you’ve configured your application as cluster: vulcan but you actually were able to submit to the cluster romulus you would need to somehow enable the binary configured in /etc/ood/config/clusters.d/vulcan.yml to be able to query both vuclan and romulus clusters because OOD will use that cluster config to run squeue.

But when would it query both clusters? It seems like you could create a wrapper script that can interact with both clusters and pass an environment variables that could tell it which cluster to interact with (or which binary to use). In this wrapper script you could catch the -M option and modify as you see fit and use the appropriate binary.

These are my initial thoughts on it. I’m not sure how easy this endeavor would be but I do know we’re adding this functionality to the next release, so it isn’t too very far away.

dugan · June 8, 2020, 10:09am

We’re supporting multiple clusters with our apps but our use case is different. Our main
cluster is very busy, with many queues and a complex usage policy. A scheduler run
typically takes 30 - 60 seconds and occasionally much longer (say when someone
submits 10,000 jobs that fail immediately.) We have provisioned dedicated resources to
support a certain class of interactive ood jobs but if they went through the main
scheduler they would experience unpleasantly long startup times.

We created a second cluster to support this class of jobs. But we don’t want the users to
have to think about it. They should just be able to request whatever resources they want
and their job should be sent to the appropriate place automatically. So the app forms
just reference the main scc cluster. The redirection to the ood cluster happens in the
wrapper scripts. We use SGE. The SGE qsub command has an option to query if a job
request can be started immediately on a cluster. So the qsub wrapper asks the ood
cluster if it can run the job and submits it there if the answer is yes, otherwise it submits
to the scc cluster. In order for OnDemand to track the jobs in the ood cluster the qstat
wrapper checks to see if a job with the right job_id and user exists in the ood cluster. If
so, the qstat request goes there, otherwise it goes to the scc cluster. The qdel wrapper
is similar.

This has been working without issue since last September but in a few months the job
ids in the scc cluster will roll over and then catchup with the job ids in the ood cluster.
So it will be possible for the same user to have jobs with the same job id in both clusters
and OnDemand won’t be able to track the one in the scc cluster. I think the probability of
this happening is low but we’ll see. I hope the future multi-cluster support will allow me
to eliminate this possibility. I would just need to be able to modify the cluster specified in
the form after the form is submitted but before the actual job submission occurs.

mcuma · June 8, 2020, 6:54pm

Thanks Jeff and Mike,

I think we’re stuck with SLURM because of the hard coded “-M cluster” option. Since the multi-cluster support is not too far out, I’ll wait till it’s officially supported, and looking forward to that.

MC

dugan · August 25, 2020, 12:35pm

Hi,

I just wanted to report that I have upgraded to 1.8 and new multi-cluster support works perfectly for my use case. Thanks!

jeff.ohrstrom · August 25, 2020, 12:45pm

Thanks! But please let us know what use cases you need to accommodate for.

Here’s on of ours as an example: different modules available on different clusters. So we have to hide/show select options for version (the module versions) depending on what cluster is chosen.

So we’ve written some javascript to handle this, but we’d like to share it as helpers so other folks can also easily do the same we just have to get a sense of what all we need to cover (or at least what we could provide that would cover a lot).

dugan · August 25, 2020, 1:23pm

Jeff, I explained my use case above on Jun 8. The ability to choose the cluster based on the
form input is all I need. The kludge I had previously implemented is now done with a couple of lines of ruby. Thanks.

mcuma · August 26, 2020, 10:20pm

Hi Jeff, I just got OOD 1.8 installed on our test instance and going through the apps to modify them to support multiple clusters.

It would be nice to have the form.yml dynamic in such a way that it’d e.g. display different attribute help for a different cluster. Is that possible, e.g. with the Javascript that you mention?

If it is it’d be nice to have an example, I am still wrapping my head around what things are used around the form.yml to render the job parameters webpage. Perhaps documenting a workflow of what happens when the webpage gets generated would help in understanding the process and the pieces that contribute to it.

Thanks,
MC

jeff.ohrstrom · April 29, 2021, 3:24pm

Hey sorry for the delay - You can see our jupyter deployment for an example javascript for how we toggle the CUDA option or what nodes are available.

github.com

OSC/bc_osc_jupyter/blob/master/form.js

'use strict'

/**
 * Clamp between two numbers
 *
 * @param      {number}  min     The minimum
 * @param      {number}  max     The maximum
 * @param      {number}  val     The value to clamp
 */
function clamp(min, max, val) {
  return Math.min(max, Math.max(min, val));
}

/**
 * Simple helper to return the capitalized version of the
 * current select cluster (i.e., Owens and Pitzer).
 */
function current_cluster_capitalized(){
  var cluster = $('#batch_connect_session_context_cluster').val();
  return capitalize_words(cluster);

This file has been truncated. show original

You can watch this ticket and/or comment on it for this feature. But we would like to add this into the core distribution so it becomes easier for admins to enable this type of interactivity.

github.com/OSC/ondemand

Add javascript helpers for commonly used patterns in batch connect apps

opened 03:23PM - 19 Jan 21 UTC

johrstrom

area/installation component/batch_connect

I know we've talked about this, but I don't know if there's any ticket for the a…ctual feature. Looks like we worked on #513 but decided against it, though I'm not sure if we decided against the approach in the PR or the idea altogether. We have a similar approach in our own BC apps, so maybe our jquery javascript approach may be mature enough. I think the result could be something like setting for node types. ```yaml options: - [ "any", "any", data-min-ppn-for-cluster-owens: 1, data-max-ppn-for-cluster-owens: 28, data-min-ppn-for-cluster-pitzer: 1, data-max-ppn-for-cluster-pitzer: 48, data-option-for-cluster-owens: true, data-option-for-cluster-pitzer: true ] ``` `data-option-for-<ATTRIBUTE>-<ATTRIBUTE VALUE>` - would register a handler and show/disable the given option or field when the attribute changed to the specific value. We do this now in our batch connect apps. `data-[min,max]-<CORES ATTRIBUTE>-for-<CHANGE ATTRIBUTE>-<ATTRIBUTE VALUE>` - would register a handler to change the mins and maxes of `CORES ATTRIBUTE` when `CHANGE ATTRIBUTE` is set to `CHANGE ATTRIBUTE VALUE`. These are common patterns we see all the time, needing to hide something based off of a selected value and changing mins and maxes. Is this approach generic enough to pick up other use cases? maybe not, but maybe we don't need to worry about _every_ use case just yet. _This_ may provide a lot of value for folks.

efranz · April 29, 2021, 3:29pm

@mcuma please reach out to Jeff and I directly via email about this. We could help you with the JavaScript that is needed in the short term but after the OOD2.0 stable release we would like to prioritize extending the form.yml DSL to better support these types of cases, so if we understand your use case specifics that may help us build the right extension.

mcuma · April 29, 2021, 4:04pm

I did not make any progress on this, the multiple-cluster setting works OK and users are sort of used not to use the advanced settings on clusters where they don’t apply.

That said, let me look at 2.0 when it gets released next week and then write back on what dynamism would be nice to have from our standpoint

Topic		Replies	Views
Multiple SLURM clusters and OnDemand Get Help question	5	1403	May 17, 2022
Launch app on specific host skipping the scheduler Get Help question	33	1093	May 26, 2022
OOD portal with Slurm as a resource manager/two clusters Get Help question	5	1612	May 26, 2022
Unable to get interactive desktops running Get Help	45	1031	April 6, 2024
Configure partitions as "clusters" Feature Requests and Roadmap Discussion	6	1374	May 26, 2022

Multiple clusters from a single app

Related Topics