Multiple clusters (multiple Slurm Schedulers) with Seperate ldap's

Hi All,

I am looking to see if anyone has thought about this, or has suggestions as to the feasibility of this scenario.
Please let me know if this has been discussed in another thread that I have not come across yet!

Current cluster setup:

Slurm scheduled cluster with local ldap server for user mapping. /home is nfs mounted on all head-nodes and compute nodes, as is the necessary Slurm configuration bits. OOD is as it should be, effectively another head-node, same mounts as the rest of the head-nodes and compute nodes. Only login method is ssh key. All is working well.

Background on current OOD Authentication (this is a little bit hacked):

Due to not having any password authentication I have modified the Keycloak login page to only have the option to authenticate with a SSO provider. In this case, the user can either be directed to Globus or CiLogon for authentication.

Once they authenticate with the SSO, Keycloak then attempts to map the primary email from the SSO to the email associated with a username in the clusters ldap (which is read-only to Keycloak). In most cases this is successful. The email that has been authenticated by the SSO is the same as the one that the user’s account in the clusters ldap is setup with and Keycloak can map the username which it then provides to PAM on OOD. PAM is happy, and the user is logged.

For those who noticed I did not explain what happens when the SSO email matches the email in ldap…

Answer: As a fallback, Keycloak makes a local account in Keycloak with the SSO provided email as a username and sends that to PAM on OOD, which of course does not map to a real username and authentication breaks…This can be a seperate thread if there is interest/I get a lot of dirty looks for this.

Essentially I am using keycloak for nothing but its Identity Federation. I know what your thinking but lets not dwell on this as it is out of context of my actual question.

Background on second cluster:

Effectively identical in scheduling and authentication to the current cluster with OOD.

This second cluster has no shared components with the first cluster. Meaning:

  • It has its own Slurm scheduler
  • It has its own nfs mounted /home and slurm configs
  • It has its own ldap…oh dear…
Correct me if I am wrong:
  • Multiple Slurm schedulers would not be an issue, just make sure the configs for it are accessable (mount the necessary nfs) and make another /etc/ood/config/clusters.d/ yaml file for cluster 2.

Actual Question/What I see as potential (or definite) issues:

  • Multiple /home's. How might this work with the File Manager? Mount them with different names according to the cluster?
  • Multiple ldap’s. What if users have access to one cluster but not the other (i.e. are in one ldap but not the other)? What if users have different usernames between the clusters…uid’s not matching? …oh my
  • If those are not game breaking, where would I start with Keycloak?

Thank you in advance for any advice! Even if it is “You are crazy just spin up a separate OOD for cluster 2…”

-Morgan

You could have the home directory from multiple clusters mounted such as for a user efranz, /fs/cluster1/efranz and /fs/cluster2/efranz. But then would $HOME point to the first, or the second or a third home directory /home/efranz that is just on the web node.

The Files app’s tree will root at $HOME so it wouldn’t root to the second if you are navigating it. This is actually a usability issue we want to address in a new version of the Files app, but this means that the tree in the Files app won’t be useful if you are navigating the other home directory. But you could still make favorite links to both home directories in the Dashboard.

The Job Composer app wants to know the dataroot to store its sqlite3 database and job files. This is typically set to a subdirectory unde4r the home directory. The problem with two separate /homes is that if the dataroot is set to a subdirectory under /fs/cluster1/efranz/ then when you go to submit a job to cluster2 the files will be copied to cluster1 and then probably not available. One work around would be to have the dataroot of the Job Composer, for job data, to be set to an nfs mounted directory that happens to be shared between both clusters - such as a scratch space directory.

What if users have access to one cluster but not the other

If this could be determined by supplemental group membership, you chgrp the cluster config to the supplemental group for the cluster a user does not have access to and chmod 640 the file.

There may be another way to add cluster authorization in OnDemand if this is not sufficient.

What if users have different usernames between the clusters…uid’s not matching

At this point you would need separate OOD install for each cluster, since the uid’s of the proceses on the web node need to be the same as on each cluster, otherwise the web node processes end up creating files with the wrong uid.

Basically, if you imagine that the host running OnDemand is a login node users can ssh to, you can probably answer all of your questions thinking about it that way. If a user ssh-es to that host, creates a job script and assocaited input files, and uses the Slurm client libraries to submit jobs to cluster1 and cluster2, how does the user accomplish their work, addressing the questions you raise? If there is a user workflow that works, there is probably an OnDemand configuration that will also work (or if not, an OnDemand bug or usability issue to fix).

We do not currently support file permission based ACL for the clusters. Adding it isn’t a heavy lift https://github.com/OSC/ood_core/pull/152, but because it requires patching a library it should wait until a new release of OnDemand.

The current group-based access control for clusters is explictly specified in the cluster config yaml. File-permission based ACL is not yet available. For an example of how OSC limits access to our oldest cluster Ruby to only members of the ruby group:

---
v2:
  metadata:
    title: "Ruby"
    url: "https://www.osc.edu/supercomputing/computing/ruby"
    hidden: false
  acls:
    - adapter: "group"
      groups:
        - "ruby"  # <- your groups go here
      type: "whitelist"
  login:
    host: "ruby.osc.edu"

Thank you efranz and rodgers.355!

With your thoughts I think OOD could support this within reason. The /home directory is an issue but the real deal-breaker is the multiple ldap’s.

Thank you for your input!

Morgan