OOD portal with Slurm as a resource manager/two clusters

All,

Sorry to be slow to understand. We have two clusters running Slurm for the resource manager and we are working on adding cluster configuration files now. I have imported the munge key from the first cluster onto the ondemand portal server (running on a separate VM) and have verified it and started up the munge daemon.

First of all, if we are running two separate clusters and when I go to tie in the second one, it looks like I will have to set up two munge keys with two separate munge daemons, with --key-file and --socket options switches (based on the Slurm documentation).

Following that, this is something I’m unclear on and wanted to make sure I understand how this is set up first before I dive into anything else. Do I need to copy the exact same slurm.conf file from the cluster I am trying to tie in with, to /etc/slurm/slurm.conf on the OOD portal node? I have the Slurm binaries installed via rpmbuild on the OOD portal node. Other than that and completing the cluster config files and being able to ssh into the login node for each cluster, that’s as far as I have gotten.

Not to make it more confusing - but here are two separate approaches you could use. In both you’d still have to have two OOD cluster configurations.

In the first you specify bin_overrides in the config and use an ssh wrapper. This shells into the login node of the appropriate server and executes the commands. This way you don’t have to worry so much (actually none at all) about configurations on the web portal’s node. There’s a description of how to do that in this topic. And you can search ssh wrapper on this site because it’s come up before.

Note that users have to be able to ssh from the web node to the login node without being prompted for this to work.

The second approach is what you’re thinking and describing. One binary, two daemons and two configs (that use each daemon respectively). Now I think you should be able to copy the slurm.conf from the cluster to the portal node and only have to modify AuthInfo for one or both of the configs. Since the daemons are booting on different sockets the configs will likely have to reflect that. Booting them manually using cli arguments would be very fragile. Using systemd would be a lot stronger but you’d have to put work into ensuring each systemd target (each cluster daemon) is isolated from the other and always boots with the right configuration. Looks like there’s a CONF_FILE environment variable you can use.

While thinking about this a little bit, the first approach seems a lot easier. The second option is probably viable, but you’d have to do it with automation. I imagine doing it by hand is likely going very hard, fragile and in the end, cause a lot of pain.

Hope that helps!

Jeff, very helpful thank you, that helped to clear things up for us. We have job submission working for each cluster with the ssh wrapper for now, great feature and we look forward to customizing it.