Ondemand with slurm based sytems, sbatch?

So we are having an issue with submitting jobs fro the job composer. Not all jobs but a vey select few. When we sbatch a jobscript it works fine from the command line, but from the job composer it fails with an odd error:
slurmstepd: error: execve(): magma: No such file or directory
srun: error: cn10: task 0: Exited with exit code 2

Magma is on the path because of a module load in the job script. The question I have is, does ondemand use sbatch or some other method to submit the script?

1 Like

Hello,

When using the Slurm adapter OnDemand does indeed call sbatch:

May I see the job script you are trying to run?

Hello,

We’ve been able to narrow the issue down srun not having access to the invoking environment… i.e.: running /bin/env within the batch file will yield very different results from srun /bin/env Most notably, the PATH is empty when using srun.

After submitting this script with the Job Composer, srun’ing env will print only SLURM_* and a handful of related variables, without any PATH at all.

#!/bin/bash

echo “** Using /bin/bash…”
echo “** module load perl…”
module load perl

echo “”; echo “** srun /usr/bin/which perl…”
srun /usr/bin/which perl

echo “”; echo “** /bin/env…”
/bin/env

echo “”; echo “”; echo “** srun /bin/env…”
srun /bin/env

There are a lot of moving parts, and it took us quite a while debugging in various ways to get this going.

Here are some hints. First, in our environment, we try to pass the entire user environment through by default. The behavior of how this works changed some in a recent SLURM release. Currently, we have this as part of ‘submit.yml.erb’:

script:
# does-not-work --> job_environment: "ALL"
native:
    - "--export=ALL"

and this as part of ‘before.sh.erb’:

unset XDG_DATA_DIRS
unset XDG_RUNTIME_DIR
unset SBATCH_EXPORT
unset MAIL
unset PYTHONPATH
unset PYTHONUNBUFFERED

export LOGNAME=$(whoami)
export USERNAME=$LOGNAME
# mysteriously fails? (perm denied?)
# export XDG_RUNTIME_DIR=/run/user/$(id -u)

Probably more hacking as well, but this is what I see immediately.

Hope that helps.

Remember in YAML formating, types and indentation are key. Maybe those didn’t work because of that?

script:
   job_envorionment:
      # job environment is a map of 'key: value'
      FOO: "bar"
      LOGNAME: $USER
   # native is under script
   native:  
   -  "--export=ALL"

Thank you both, michaelkarlcoleman and johrstrom; we’ll look into these suggestions.

I actually just stopped by to add a few additional pieces of information to this puzzle:

First some background: this issue started for us after a maintenance day in which we upgraded both OOD and Slurm at the same time. (to Slurm 19.05.3-2; OOD 1.6.20/1.35.3)

According to the srun docs/manpage:

–export=<environment variables [ALL] | NONE>
Identify which environment variables are propagated to the launched application. By default, all are propagated.

But nonetheless, even though that implies this shouldn’t be necessary: env vars availability returns when explicitly specifying export=all within each "srun --export=ALL " statement. (Additionally, all of this holds true whether or not there’s an “#SBATCH --export=all” at the top of the script.)

(I’ve not thoroughly picked apart our yaml/configs yet since they worked with the previous version but I did validate their YAML.)

OnDemand version: v1.6.20 | Dashboard version: v1.35.3
Slurm version: slurm 19.05.3-2
 ** module load perl ...

 ** srun /usr/bin/which perl ...
/usr/bin/which: no perl in ((null))
srun: error: cn63: task 0: Exited with exit code 1

 ** srun --export=ALL /usr/bin/which perl ...
/packages/perl/5.28.1/bin/perl

 ** /bin/env | grep -vc SLURM ...
 ** /bin/env | grep ^PATH ...
35
PATH=/packages/perl/5.28.1/bin:/packages/git/2.16.3/bin:<SNIP!>

 ** srun --export=ALL /bin/env | grep -vc SLURM ...
 ** srun --export=ALL /bin/env | grep ^PATH ...
40
PATH=/packages/perl/5.28.1/bin:/packages/git/2.16.3/bin:<SNIP!>

 ** srun /bin/env | grep -vc SLURM ...
 ** srun /bin/env | grep ^PATH ...
6

We do not have a submit.yml.erb or before.yml.erb set of files. Would it be possible to include the submit section in /etc/ood/config/clusters.d/my_cluster.yml?

I don’t think there is a way to easily modify the Job Composer’s submission arguments though that does sound like a good idea.

If users are able to successfully submit the job using sbatch from the command line from a login node but the same sbatch is failing from the web node, there is another approach you could take.

You can provide a wrapper script for sbatch that will ssh to the login node and execute sbatch there. Here is an example wrapper script:

and associated overrides https://github.com/puneet336/OOD-1.5_wrappers/tree/master/openondemand/1.5/wrappers/slurm/bin

See https://osc.github.io/ood-documentation/master/installation/resource-manager/slurm.html. You would deploy your wrapper script for sbatch on the webnode, such as /usr/local/bin/sbatch_ssh_wrapper and then modify the cluster config to use this for sbatch:

 job:
   adapter: "slurm"
   cluster: "my_cluster"
   bin: "/path/to/slurm/bin"
   conf: "/path/to/slurm.conf"
+  bin_overrides:
+    sbatch: "/usr/local/bin/sbatch_ssh_wrapper"

If you do this, it affects how all of OnDemand submits to that particular cluster, not just the Job Composer. Here is a relevant Discourse discussion: Question About Passing Environment Variables for PBS Job

Our sbatch script runs fine on the web node when sbatched from a terminal window (cli) launched by ood. It just has environement issues when submitted from the job composer. This job that I am using for testing used to run just fine from the job composer. Something has changed with the latest release.