Multiple SLURM clusters and OnDemand

Hi everyone,

We are a PBS shop and are starting to play around with SLURM on our clusters. We have everything working fine with one cluster, but now that we are adding another one I’m not sure what we need to do to get it to work. We have the two clusters setup independently, so a slurmdbd, slurmctld, etc. on each. We setup the clusters.d files to point to the appropriate slurm.conf files, but when we go to submit an interactive app it says it has an unrecognized cluster id.

Do we need to setup SLURM in a multi-cluster configuration in order to get this to work? Is there another way to set this up without that? We noticed a few other tickets talking about this, but they were not exactly our issue. This is most likely due to our inexperience with SLURM, so any help would be appreciated.

Please let me know what files/information would be helpful in answering this question and I’ll get it added.



@msgambati-INL This should be a quick fix! Can you share your cluster configuration file for your second cluster in clusters.d/?

Here you go:

# /etc/ood/config/clusters.d/[cluster_1].yml
    title: "[cluster_1]"
    url: "http://[hostname_1]/hardware/[cluster_1]"
    hidden: false
    host: "[hostname_2]"
    adapter: "slurm"
    host: "[hostname_3]"
    bin: "/opt/slurm/bin"
    conf: "/opt/slurm/etc/slurm_[cluster_1].conf"
  - adapter: "group"
      - "[user_1]"
      - "[user_2]"
      - "[user_3]"
      - "[user_4]"
    type: "whitelist"
      script_wrapper: |
        module purge
      script_wrapper: |
        module purge
        module use /apps/system/modulefiles
        module load ood_vnc

You may need v2.job.cluster in the cluster.d file. I think with that and the separate config files (that you have configured there) it may work.

When you’re in a shell session on that machine do you need to use the -M flag? That’s what v2.job.cluster will provide. I guess that’s my next question, what command args do you have to provide on the machine to submit jobs to either cluster?

When we add v2.job.cluster in the cluster.d file we get the following error when submitting to the cluster via the interactive form

sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sbatch: error: Sending PersistInit msg: Connection refused
sbatch: error: Sending PersistInit msg: Connection refused
sbatch: error: DBD_GET_CLUSTERS failure: Connection refused
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Connection refused.  Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.

If I’m on the cluster, I don’t need to use the -M flag. For example, while on a login node for the system we’re trying to submit to, the following command successfully submits an interactive job - srun -A <code> -n 16 -G 1 --time=0-03:00 --pty bash -i. If I try that on the ondemand server via the command line, I get sent to the first slurm cluster.

Using the -M flag from the command line via the ondemand server, I get an error srun: error: Application launch failed: Communication connection failure.

Edit: The -M flag on the command line via the ondemand server does work when I change the default slurm.conf to be the correct slurm.conf of the server. Otherwise if the slurm.conf is the other server’s configuration, it doesn’t work, which makes sense. I’m not sure if there’s a flag in srun or sbatch to specify a config file.