Slurm job fails in OOD but works at CLI

cupdike · May 27, 2020, 7:12pm

OnDemand version: v1.6.22 | Dashboard version: v1.35.3

My sbatch job works fine on CLI:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1 # Num MPI process per node
#SBATCH --output=out.%j.txt
#SBATCH --error=err.%j.txt
#SBATCH --gres=gpu:1

NCCL_SOCKET_IFNAME=eth0 \
    mpirun \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME \
    --mca btl tcp,self \
    --mca pml ob1 \
    --mca btl_tcp_if_include "eth0" \
    --mca mpi_show_mca_params all \
    singularity exec /mnt/shared/images/singularity/nc-tensorflow.sif python /home/updikca1/Horovod.MNIST.orig.py

But it fails when launched using the UI (sys/myjobs/workflows) Error is:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

How can I go about debugging this?

jeff.ohrstrom · May 28, 2020, 3:41pm

strace maybe? When you say you can run it from the cli, I’d ask if it’s the same host? The cli from the same server as OOD or some other login host? This could be your discrepancy, there could actually be some networking issue there.

It could also be the environment. OOD defaults to SBATCH_EXPORT=NONE so you can load a brand new environment. I would say add an env statement to see if there’s something you’re missing (some LD_LIBRARY_PATH missing?).

jeff.ohrstrom · September 23, 2020, 8:17pm

@cupdike were you able to figure out and/or resolve this error?

I’ll bet it’s due to OOD defaulting to SBATCH_EXPORT=NONE, I believe that has issues with parallelism. I wonder if adding #SBATCH --export=ALL directive to the script overrides this?

Topic		Replies	Views
Job composer and star-ccm+ Get Help ondemand2 , question	10	1557	February 15, 2023
OpenMPI not work in Open OnDemand Get Help	5	219	January 1, 2024
Ondemand with slurm based sytems, sbatch? Get Help	16	3367	May 26, 2022
Problems with Remote Desktop application and MPI Get Help	5	1412	May 19, 2022
Interactive Desktop batch job submission failure Get Help ondemand2	5	235	June 14, 2023

Slurm job fails in OOD but works at CLI

Related Topics