Unable to remove/delete jobs in Job Composer with incorrect status

Under some circumstances, jobs that are no longer queued/running will still appear in the job composer with the wrong status. I would like to remove the listing from the job composer, but I can’t find any way to remove them. Clicking the “Delete” button attempts to stop the job running on the backend resource. However, since the job no longer exists on the backend, the request to cancel the job fails. So, I see this error on the portal:

"Job failed to be destroyed: ["An error occurred when trying to stop jobs for simulation 12: “]”

and this error is reported by the pun:

App 5652 output: [2020-05-04 11:42:51 -0700 ] ERROR “During update_status! call on job with pbsid 32872040 and id 1 a PBS::Error was thrown:”
App 7630 output: [2020-05-04 11:42:52 -0700 ] ERROR “An error occurred when trying to stop jobs for simulation 12:”

I have noticed this error a couple of different times. I have some jobs that are in a “queued” and “running” states that I am unable to remove. I am able to reproduce the error with a “queued” job by following these steps:

  1. Submit a job through the job composer
  2. Before the job starts running on the backend resource, log in to the backend resource and cancel the job.
  3. The job in the job composer reports that it is “queued” (despite having been cancelled) and it cannot be deleted.

I was using OOD version 1.6.20 and the backend resource was a Slurm cluster for all these tests.

Is there something that I am missing? Is there some way to remove these jobs?

Seems like there are two bugs here.

  1. Job Composer not allowing you to remove a job stuck in a queued/running state
  2. ood_core adapter throwing an exception when trying to stop or check the status of a job that has already left the queue

What version of Slurm are you using? Besides the job just exiting the queue before the Job Composer tries to access, is there anything else that is special about the jobs that cause this problem?

Thanks for the response. The cluster is running Slurm version 14.11.11 . There shouldn’t be anything special about the job. Here is the script that is getting submitted:

#!/bin/bash
# JOB HEADERS HERE
#SBATCH --time=05:00

echo "Hello World"

One custom part of my setup is that I followed the guide from the docs to setup bin_overrides. The override scripts execute the slurm commands over ssh.

Could you share the bin_override scripts you are using?

Also, what does the output of squeue look like for you if you do so for a job id that was just cancelled, verses a job id that is completely invalid? Can you confirm that the bin_override squeue script produces the same output?

On a system using slurm 19.05.5 I see this:

$ squeue -j 888888888
slurm_load_jobs error: Invalid job id specified
$ squeue -j 8031955
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$
  • 8031955 is a valid id of a job i just cancelled
  • 888888888 is an invalid job id

Is there a particular reason you are still on SLURM 14.x? That version has been EOL for many years.

I was working on some other configuration that changed how the remote user was mapped to a system user and now the I am unable to replicate the error that I was seeing before. Previously, the system user on the OOD node had a different username than the user that the slurm commands were being run as on the slurm cluster ( I am using ssh to run the slurm commands). My guess is that some of the bin_override scripts was not handling this correctly. However, I’ve now matched usernames on both systems and things seem to be working now.

Now, the job status get properly updated when the job is cancelled and it allows me to delete the jobs from the job composer. Thanks for taking the time to help me out!

Although the problem is resolved now, here is the output from my squeue:

[tbpetersen@comet-ln3 ~]$ squeue -j 33172502
slurm_load_jobs error: Invalid job id specified
[tbpetersen@comet-ln3 ~]$ squeue -j 88888888888
squeue: error: Invalid job id: 88888888888

where 33172502 was a recently cancelled job and 88888888888 is an invalid job id

Here is the bin_override for squeue:

#!/usr/bin/python

from getpass import getuser
from select import select
from sh import ssh, ErrorReturnCode  # pip3 install sh
import os
import re
import sys
import syslog

SUBMISSION_NODE = 'comet.sdsc.edu'
USER = os.environ['USER']


def run_remote_bin(remote_bin_path, *argv):
  output = None

  try:
    result = ssh(
      SUBMISSION_NODE,
      '-q',
      '-oBatchMode=yes',  # ensure that SSH does not hang waiting for a password that will never be sent
      remote_bin_path,  # the real sbatch on the remote
      *argv,  # any arguments that sbatch should get
      _err_to_out=True  # merge stdout and stderr
    )

    output = result.stdout.decode('utf-8')
    syslog.syslog(syslog.LOG_INFO, output)
  except ErrorReturnCode as e:
    output = e.stdout.decode('utf-8')
    syslog.syslog(syslog.LOG_INFO, output)
    print(output)
    sys.exit(e.exit_code)

  return output

def filter_args(args):
  new_args = list(filter(lambda arg: arg != '--noconvert', args))
  return new_args

def main():
  output = run_remote_bin(
    '/bin/squeue',
    filter_args(sys.argv[1:])
  )

  print(output)

if __name__ == '__main__':
  main()

and for scancel:

#!/usr/bin/python

from getpass import getuser
from select import select
from sh import ssh, ErrorReturnCode  # pip3 install sh
import os
import re
import sys
import syslog


SUBMISSION_NODE = 'comet.sdsc.edu'
USER = os.environ['USER']


def run_remote_bin(remote_bin_path, *argv):
  output = None

  try:
    result = ssh(
      SUBMISSION_NODE,
      '-oBatchMode=yes',  # ensure that SSH does not hang waiting for a password that will never be sent
      remote_bin_path,  # the real sbatch on the remote
      *argv,  # any arguments that sbatch should get
      _err_to_out=True  # merge stdout and stderr
    )

    output = result.stdout.decode('utf-8')
    syslog.syslog(syslog.LOG_INFO, output)
  except ErrorReturnCode as e:
    output = e.stdout.decode('utf-8')
    syslog.syslog(syslog.LOG_INFO, output)
    print(output)
    sys.exit(e.exit_code)

  return output


def main():
  output = run_remote_bin(
    '/bin/scancel',
    sys.argv[1:]
  )

  print(output)

if __name__ == '__main__':
  main()

Thanks!

No reason in particular. I’m not involved in the maintenance of the cluster, so I can’t say for certain, but I would suspect is has to do with compatibility

I started seeing this error again, so I don’t think that user mapping was causing the problem.

I think the problem was actually coming from the squeue bin_override script. When the script gets called on a job that does not exist, an error is thrown and the script ends up exiting with an error code. Once I changed the script to exit with a return code of 0 when an error was caught, all of the jobs that had been finished/cancelled and were stuck in a “queued” or “running” state got updated to be “completed”.

For clarity, I changed
sys.exit(e.exit_code)
to
sys.exit(0)

I’m not sure exactly how OOD uses the return code of the bin_override scripts (specifically the one for squeue), but I think that the example needs to either have a return code of 0 when an error is caught or have some additional logic added to handle the case when squeue is called on a job that no longer exists. Any thoughts on the problem/suggested solution?