LSF multi-cluster environment deleting panel

jeff.ohrstrom · July 15, 2020, 8:04pm

You can’t use generic.yml. You will need 3 different configurations for all 3 sites. This is the only way to avoid collisions and fix the issue of jobs being killed on another site. Your first point is solvable, you just need to have specific configs. Multi-clusters is just too speific to LSF and OOD needs to support schedulers in a generic way, that’s why the filename is directly the cluster_id. That way apps like the dashboard can interact with schedulers without actually knowing their implementation.

For your second point, you won’t need new apps just app specific configurations for each site. You may need a new submit.yml for a given app anyhow to specify the -clusters option. Just drop a new submit.yml (and form.yml) at /etc/ood/config/apps/my_app/submit.yml. You can see an example of this here where we use the same underlying RStudio installation (with it’s default form.yml) for all our sites, but simply provide a different form & submit for this site specifically. With this you can override form.yml, submit.yml (and .erb extensions) and form.js for each site instead of creating specific apps.

Do you want to run and view rka jobs from the rid site? This increases the configs. In this case you’d need all 3 cluster configs at each site and separate apps (at least until 1.8, then you’d only need all 3 cluster configs and your apps could choose the cluster). I don’t suspect this is what you want, but just in case, there it is.

fenz · July 16, 2020, 8:40am

The example you shared looks interesting.
The login node (where OOD runs) knows which is the cluster he belongs to (using an env variable). Would it be possible to access an env variable from ruby to have the “cluster” value in the “form.yml” dynamically set?
In your example, change from:

---
cluster: "owens"

to:

---
cluster: <%- ENV.CLUSTER_NAME -%>

This will allow to use one single file for all sites in our configuration.
Regarding the value in the “submit.yml”, I would like to have the user selecting it in the form (web page) and access the value from the file like:

# An RKA app's submit.yml
script:
  native:
    - "-clusters"
    - <%- cluster_name -%>

This should be feasible as well.
We can have multiple configuration in all sites like:

/etc/ood/config/clusters.d/rka.yml
/etc/ood/config/clusters.d/rid.yml

No problem here as well.
When a user run an app from “rka” then “cluster_id” will be set to “rka” since this is the value that OOD should see in the form.yml (dynamically set).

We can try all of this and see if we have issue but my question is always the same:
If I start an app from “rka” and cluster_id is set to “rka”, when I connect to “rid” and check the active sessions, what is OOD going to pass to the -m option of bjobs? Will it use “rka” from the “cluster_id” value?
If the answer is yes I guess this setting should work for us.
If the answer is no, what is OOD using to pass to “-m” option? this is still not clear to me since we will have multiple cluster configuration files, how OOD decide which one to use?

jeff.ohrstrom · July 16, 2020, 2:00pm

Whatever is the v2.job.cluster of rka.yml. If this attribute isn’t populated, it won’t pass any -m option. If the rka.yml doesn’t exist on rid it won’t do anything, it’ll leave the file and the job alone. If the file doesn’t exist, it literally doesn’t know how to interact with the rka cluster (it could be SLURM or PBSPro and so on).

The -m option is generated by the configuration file’s v2.job.cluster parameter. If the file rka.yml has v2.job.cluster: rka. Then jobs generated with cluster_id = rka (remember cluster_id is the filename) will be queried with -m rka. If v2.job.cluster it doesn’t use the -m option.

Here is the flow of finding a job and querying for it, so please refer back to it if needed: When OOD finds a job with cluster_id = rka (on any site) it will attempt to create an adapter for this cluster id (rka). It looks for a file called rka.yml and tries to create the library. If it can’t find the file, it will do nothing (nothing! it won’t delete the job, won’t query the scheduler, the flow just stops). If it does find the file rka.yml it’ll read the configuration and use it. Since this is LSF, it look for v2.job.cluster to be populated. If it is, it’ll use it as the -m argument when running bjobs. It not, it won’t pass any -m argument.

As to using ENV variables, that’s probably better as it’s a bit simpler. If CLUSTER_NAME is already set you may be able to pass it through here are docs on that. But PUNs start up with limited environments and they’re started from root, so they don’t really hit the users’ bash profile stuff.

A simpler way may be to drop OOD_CLUSTER_NAME=rka entry in /etc/ood/config/apps/dashboard/env and access it like this.

---
cluster: "<%= ENV['OOD_CLUSTER_NAME'] -%>"

Note, you can probably just reuse the env variable in the submit.yml.erb.

script:
  native:
    - "-clusters"
    - "<%= ENV['OOD_CLUSTER_NAME'] -%>"

fenz · July 16, 2020, 5:09pm

I probably didn’t express well myself.
We have 2 options here about clsuter configuration:

Have one configuration per site:

rka has:
/etc/ood/config/clusters.d/rka.yml
rid has:
/etc/ood/config/clusters.d/rid.yml
All configuration available in each site:

Each site has both files:
/etc/ood/config/clusters.d/rka.yml
/etc/ood/config/clusters.d/rid.yml

If those are not the options we have just stop me here

Now. In case of option 1 there’s just 1 file so it is clear that this is the file where OOD will look for “cluster” name.
This mean that if I start an interactive app from “rka” and I connect to “rid” the session will be cleaned (because rid will run bjobs -m rid and will not find the job started in rka)
Is this the behaviour?
In case of option 2, I will have 2 configuration file. When I start an interactive app from rka it will use the cluster that I configured in “form.yml” and set cluster_id=“rka”.
But when I connect to rid, it will have 2 configuration file (like in rka), how does OOD decide if to use rka.yml or rid.yml to get the “cluster” name?
This was my doubt.
As far as I got so fat “cluster_id” is just set but never used with the “-m” option. The -m option comes only from the “yml” file in /etc/ood/config/clusters.d/, right?
I hope I got something correct after all this discussion and I could explain my doubts.

jeff.ohrstrom · July 16, 2020, 7:46pm

You’re fine. My explanation was a bit wrong, so I went ahead and tested option 1 and confirmed the behavior.

Here is the flow of finding a job and querying for it, so please refer back to it if needed:

When OOD finds a job with cluster_id = rka (on any site, rid, rka, my cluster at OSC, wherever) it will attempt to create an adapter for this cluster id (rka). It looks for a file called rka.yml (because the cluster_id was rka. cluster_id is the filename. The filename is the cluster_id. This is true both when you create the job and when you go back to query for it.) and tries (tries!) to create the library.

(option 1)
If it can’t find the cluster configuration file (like you’ve logged into rid and you can’t find the rka configuration) it’ll get confused and create a panel for this job in an “Undetermined State”. It has a delete button, but it won’t work and it says to contact support. OOD can’t delete the job because it doesn’t know how to. On rid it has no idea how to interact with the rka cluster, if it’s SLURM or Torque or whatever.

(option 2).
If it does find the file rka.yml it’ll read the configuration and use it.

(option 2 - bad)
Since this is LSF, it look for v2.job.cluster to be populated. If it’s not populated, it won’t use the -m option. This is problematic because it can successfully execute the bjobs command and LSF says “that job doesn’t exist” (because you end up querying rid for an rka job), so it deletes it.

(option 2 - good)
Since this is LSF, it look for v2.job.cluster to be populated. If it is, it’ll use it as the -m argument when running bjobs . If the rka.yml file has v2.job.cluster: "rka" it will submit a bjobs command with -m rka. This means you’ll be able to view RKA jobs on RID.

fenz · July 19, 2020, 8:59am

So option 2 is the one we have to go for: populate v2.job.cluster in the configuration yml and use “<%= ENV[‘OOD_CLUSTER_NAME’] -%>” to tell OOD what to use run a job (and set “cluster_id” value accordingly).
This means from any cluster we can decide where to run a job using the ENV variable and from any cluster we will query the right cluster (and get the same jobID) so the session will not disappear anymore! That’s great!!!
One last question, is this working in any OOD version or do we need v1.8?
We are on version 1.6.x and we are going to move to v1.8 anyway but I was wondering if we should apply this configuration (option 2) already now or wait and do it on the next release.
By the way, I guess we are done.
One last comment, I guess it will be better to change the title of the thread adding “in multi-cluster environment” (if possible), this may be more useful in case of anyone else having a similar situation in the future.
Thanks a lot for all your help!

jeff.ohrstrom · July 20, 2020, 8:05pm

This will work on your current version 1.6. Glad we got through it!

Topic		Replies	Views
RStudio interface disappears and user cannot re-connect with existing session Get Help	5	949	May 26, 2022
"Unrecognized option -x509cert" Get Help	5	606	May 26, 2022
<App> failed to start waiting for port Get Help ondemand2 , question	7	228	September 19, 2023
Problems starting VNC on OOD with LSF manager Get Help	9	1529	May 26, 2022
Interactive sessions stuck at starting stage Get Help	6	1503	May 26, 2022

LSF multi-cluster environment deleting panel

Related Topics