Active Jobs not displaying any jobs in the queue

Open OnDemand 1.1.18
CentOS 7
Slurm

When trying to use the Active Jobs page I am only presented with one line for where the jobs should be displayed and it shows null for the name, user and queue fields with none of the jobs in the queue displayed.

Contents of cluster config file from: /etc/ood/config/clusters.d/clustername.yml


v2:
metadata:
title: “Cluster Title”
url: “https://hostname.fqdn
hidden: false
login:
host: “hostname.cluster.fqdn”
job:
adapter: “slurm”
cluster: “hostname”
bin: “/bin”
conf: “/etc/slurm/slurm.conf”
copy_environment: true
batch_connect:
basic:
script_wrapper: |
module purge
%s
set_host: “host=$(hostname -A | awk ‘{print $1}’)”

Where are the logs that I can examine for what is causing this failure or what am I missing from my config to have this work correctly?

-Saj-

Some more information to add. Checking back on the Active Jobs page again today after a while I get the following block of error messages in the browser

====================================================
No job details available. ["/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:124:in initialize'", "/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:124:in new’", “/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:124:in extended_data_slurm'", "/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:44:in initialize’”, “/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:78:in new'", "/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:78:in get_job’”, “/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:31:in block (2 levels) in json'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/mime_responds.rb:203:in respond_to’”, “/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:24:in json'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/basic_implicit_render.rb:6:in send_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/base.rb:194:in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/rendering.rb:30:in process_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/callbacks.rb:42:in block in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/callbacks.rb:132:in run_callbacks’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/callbacks.rb:41:in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/rescue.rb:22:in process_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/instrumentation.rb:34:in block in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/notifications.rb:168:in block in instrument’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/notifications/instrumenter.rb:23:in instrument'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/notifications.rb:168:in instrument’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/instrumentation.rb:32:in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/params_wrapper.rb:256:in process_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/base.rb:134:in process'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionview-5.2.4.4/lib/action_view/rendering.rb:32:in process’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/live.rb:255:in block (2 levels) in process'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/dependencies/interlock.rb:42:in block in running’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/concurrency/share_lock.rb:162:in sharing'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/dependencies/interlock.rb:41:in running’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/live.rb:247:in block in process'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/live.rb:291:in block in new_controller_thread’”]

What version of Slurm do you have?

You seem to be hitting an exception on this line. I’m guessing because it cannot find the working directory from the Slurm output.

On the server running the On-Demand software I am running Slurm 18.08. The cluster itself runs Slurm 20.02. I am able to run the various slurm commands from the command line and they work as expected.

What does the output from this command look like when you run it from the OOD server itself? Note that I have "-M owens" here, you’ll need to replace owens with your own clustername. And with any copy+pasting you may have to tweak the quotes and so on.

/usr/bin/squeue \
  --all \
  --states=all \
  --noconvert \
  -o \
   "\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b" \
  "-M" \
  "owens"

I get a lot of records and clearly it’s not super easy to see them because we use a funny field separator but if you look closely you should see \u001E is the record seperator so that should be first followed by a bunch of <some caracters>\u001F

\u001Epzs0714\u001F12612121\u001Fo0808\u001F1\u001F1\u001F0\u001F1\u001F2021-01-12T11:52:11\u001F(null)\u001F(null)\u001F12612121\u001FPZS0714\u001F5515\u001FOK\u001F*\u001F12612121\u001F*\u001Fondemand/sys/dashboard/sys/bc_desktop/vdi\u001F*\u001Fstdout=/users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/vdi/output/a1f4cdcd-a3b0-4125-a80a-1755d024d83b/output.log\u001FN/A\u001F1:00:00\u001F51:42\u001F4315M\u001F8:18\u001F\u001Fo0808\u001F(null)\u001F0\u001Fowens-default\u001Fquick\u001F1000500241\u001FNone\u001F2021-01-12T10:52:11\u001FR\u001FRUNNING\u001Fjohrstrom\u001F30961\u001F(null)\u001F2021-01-12T10:52:09\u001F(null)\u001F(null)\u001F\u001FN/A\u001F0\u001F(null)\u001F*:*:*\u001F/users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/vdi/output/a1f4cdcd-a3b0-4125-a80a-1755d024d83b\u001FN/A

I would also search your /var/log/ondemand-nginx/<user>/error.log for squeue and see if there are any errors near that entry.

Running the above command gives the error:

slurm_load_jobs error: Socket timed out on send/recv operation

Strangely enough if I remove the “-M” clustername flag it works

Also there are no squeue errors in the /var/log/ondemand-nginx/<user>/error.log file

OK. I think I see. Your v2.job.adapter.cluster is the hostname? It should be the ClusterName field from you slurm.conf

Here’s ours:

v2:
  job:
    adapter: "slurm"
    cluster: "pitzer"
    host: "pitzer-slurm01.ten.osc.edu"
    lib: "/usr/lib64"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"

with ClusterName=pitzer in our /etc/slurm/slurm.conf.

You can remove this cluster attribute, but I believe it’s an optimization to limit queries to the slurm controller if you have multiple clusters.

Removing the cluster attribute did the trick. Currently only running one cluster.

That was actually a typo the, cluster attribute and host attribute were not the same. I mistyped when I created the post