Error After Slurm Upgrade

shawn.doughty · February 20, 2019, 4:44pm

We just upgraded our slurm back end to 17.02.11 and broke the OnDemand front end(s). Interactive Apps gives an error “ERROR: OodCore::JobAdapterError - sbatch: error: slurm_receive_msg:” “Zero Bytes were transmitted or receive”.

Active Jobs doesn’t give an error but also doesn’t display any results.

This happens on our original production non-rpm based install as well as rpm install (OOD 1.4) on dev server. Oddly enough another 1.4 rpm install works just fine. Any ideas?

The slurm clients on all the submit nodes are still at 15.x and everything worked before the backend upgrade.

efranz · February 20, 2019, 6:11pm

Oddly enough another 1.4 rpm install works just fine

Which 1.4 rpm version failed and which one succeeded?

shawn.doughty · February 20, 2019, 6:39pm

Before the Slurm upgrade everything works but now,

Failed: Original production server running OOD 1.2
Failed: Dev server running OOD 1.4 RPM install.
ondemand-1.4.10-1.el6.x86_64
ondemand-release-web-1.4-1.el6.noarch
Working: Sequestered production server running OOD 1.4 RPM install.
ondemand-release-web-1.4-1.el7.noarch
ondemand-1.4.10-1.el7.x86_64

For 2, 3 maybe it is because #2 is running RHEL 6 while #3 is running RHEL 7. If I can get the RHEL 6 box working we may very well upgrade #1 and finally move that to an RPM install.

Let me know if this helps.

shawn.doughty · February 20, 2019, 7:52pm

I can replicate this outside of slurm with -M or --clusters aka any command that specifies a cluster. What file can I edit in OOD to turn off the -M option? This will help me get it back up and running (hopefully) while looking at the larger slurm issue. I could just be grasping a straws with this idea but we shall see.

efranz · February 20, 2019, 8:08pm

In the cluster config in the job section if you omit cluster: then -M will no longer be used in the commands

shawn.doughty · February 20, 2019, 8:52pm

Thanks tremendously for your help, that was it! Everything is working now.

efranz · February 20, 2019, 9:20pm

You are welcome! Is there something I could change to https://osc.github.io/ood-documentation/master/installation/resource-manager/slurm.html to better call attention to this?

shawn.doughty · February 21, 2019, 1:48am

I would just specifically say “remove entry on non-multi cluster otherwise it may cause errors depending on the version of OOD and scheduler”. In my case, everything (fortunately) worked this entire time, up until today.

NathanielMiddleton · October 28, 2021, 5:18pm

I stumbled upon this in my upgrade from 20.11 to 21.08… if anyone else stumbles on this… this is what I found:
I ran “sacctmgr show clust format=Cluster,ControlHost,ControlPort,RPC”
and the ControlHost showed 127.0.0.1 vs an IP that OOD could use to reach slurmctld.

This lead me to realize that I had added an /etc/hosts entry for the slurmctld host during the upgrade… which was being passed out to the ondemand host. The ondemand host could be seen faithfully reaching out to localhost… which didn’t work.

I claim that removing -M does work, but the underlying issue… for me at least would have revealed itself again if I added another cluster in the future.

Topic		Replies	Views
Open OnDemand repo is not working Feature Requests and Roadmap Discussion	10	1231	May 26, 2022
Open OnDemand 1.7.14 patch release now available Announcements	2	794	May 29, 2020
Slurm version support Get Help	4	522	May 26, 2022
Update issue with 1.8 Get Help	2	449	May 26, 2022
OOD missing rails-5.2.8 after update/reinstall Get Help	6	243	April 19, 2023

Error After Slurm Upgrade

Related Topics