I’ll blame Matthew and Brandon from Idaho National Lab for this but am asking the OOD developers for help
The issue is, INL has successfully modified their Desktop and Jupyter app to include multiple clusters in a single app, by an extra parameter in the app’s form.yml (or desktop’s cluster.yml), like
- [“Cluster1”, “cluster1”]
- [“Cluster2”, “cluster2”]
and then feeding this pbs_cluster variable to the submit.yml.erb:
- “<%= pbs_queue %>@<%= pbs_cluster %>”
We use SLURM so this PBS solution does not work for us, but, we do use a single slurmdbd for all our clusters so we can cross-submit jobs with the -M flag (sbatch -M cluster1 …).
So, I added into our setup a generic cluster that uses SLURM binaries that work across all our clusters, and use -M in the submit.yml.erb to direct the job to a specific cluster:
- “<%= slurm_cluster %>”
In the process, I discovered that OOD has in ln 279 of gems/ood_core-0.11.3/lib/ood_core/job/adapters/slurm.rb hard coded the -M flag:
args += ["-M", cluster] if cluster
I tried to comment out the flag (since I feed it in through the submit.yml.erb), and, that does submit the job with the app (desktop) starting on the compute node correctly, but, OOD Interactive Sessions does not know about this job. I suspect because OOD behind the scenes queries the SLURM about the job status and since I removed the -M from the SLURM adapter, the commands like squeue don’t have the appropriate cluster name.
Perhaps there could be a simple fix to this for us (rather than waiting for future OOD release that should allow this), which is I am asking for feedback.
I guess the simplest way would be to set the cluster variable in the slurm.rb to the slurm_cluster variable that I define in the submit.yml.erb, but, I don’t know the complexities of how all these things interact behind the scenes.
I appreciate any thoughts on this.