Error After Slurm Upgrade


#1

We just upgraded our slurm back end to 17.02.11 and broke the OnDemand front end(s). Interactive Apps gives an error “ERROR: OodCore::JobAdapterError - sbatch: error: slurm_receive_msg:” “Zero Bytes were transmitted or receive”.

Active Jobs doesn’t give an error but also doesn’t display any results.

This happens on our original production non-rpm based install as well as rpm install (OOD 1.4) on dev server. Oddly enough another 1.4 rpm install works just fine. Any ideas?

The slurm clients on all the submit nodes are still at 15.x and everything worked before the backend upgrade.


#2

Oddly enough another 1.4 rpm install works just fine

Which 1.4 rpm version failed and which one succeeded?


#3

Before the Slurm upgrade everything works but now,

  1. Failed: Original production server running OOD 1.2
  2. Failed: Dev server running OOD 1.4 RPM install.
    ondemand-1.4.10-1.el6.x86_64
    ondemand-release-web-1.4-1.el6.noarch
  3. Working: Sequestered production server running OOD 1.4 RPM install.
    ondemand-release-web-1.4-1.el7.noarch
    ondemand-1.4.10-1.el7.x86_64

For 2, 3 maybe it is because #2 is running RHEL 6 while #3 is running RHEL 7. If I can get the RHEL 6 box working we may very well upgrade #1 and finally move that to an RPM install.

Let me know if this helps.


#4

I can replicate this outside of slurm with -M or --clusters aka any command that specifies a cluster. What file can I edit in OOD to turn off the -M option? This will help me get it back up and running (hopefully) while looking at the larger slurm issue. I could just be grasping a straws with this idea but we shall see.


#5

In the cluster config in the job section if you omit cluster: then -M will no longer be used in the commands


#6

Thanks tremendously for your help, that was it! Everything is working now.


#7

You are welcome! Is there something I could change to https://osc.github.io/ood-documentation/master/installation/resource-manager/slurm.html to better call attention to this?


#8

I would just specifically say “remove entry on non-multi cluster otherwise it may cause errors depending on the version of OOD and scheduler”. In my case, everything (fortunately) worked this entire time, up until today.