Active Jobs not displaying any jobs in the queue

ssood · January 6, 2021, 8:24pm

Open OnDemand 1.1.18
CentOS 7
Slurm

When trying to use the Active Jobs page I am only presented with one line for where the jobs should be displayed and it shows null for the name, user and queue fields with none of the jobs in the queue displayed.

Contents of cluster config file from: /etc/ood/config/clusters.d/clustername.yml

v2:
metadata:
title: “Cluster Title”
url: “https://hostname.fqdn”
hidden: false
login:
host: “hostname.cluster.fqdn”
job:
adapter: “slurm”
cluster: “hostname”
bin: “/bin”
conf: “/etc/slurm/slurm.conf”
copy_environment: true
batch_connect:
basic:
script_wrapper: |
module purge
%s
set_host: “host=$(hostname -A | awk ‘{print $1}’)”

Where are the logs that I can examine for what is causing this failure or what am I missing from my config to have this work correctly?

-Saj-

ssood · January 12, 2021, 3:00pm

Some more information to add. Checking back on the Active Jobs page again today after a while I get the following block of error messages in the browser

====================================================
No job details available. ["/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:124:in initialize'", "/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:124:in new’", “/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:124:in extended_data_slurm'", "/var/www/ood/apps/sys/activejobs/app/models/jobstatusdata.rb:44:in initialize’”, “/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:78:in new'", "/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:78:in get_job’”, “/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:31:in block (2 levels) in json'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/mime_responds.rb:203:in respond_to’”, “/var/www/ood/apps/sys/activejobs/app/controllers/jobs_controller.rb:24:in json'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/basic_implicit_render.rb:6:in send_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/base.rb:194:in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/rendering.rb:30:in process_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/callbacks.rb:42:in block in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/callbacks.rb:132:in run_callbacks’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/callbacks.rb:41:in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/rescue.rb:22:in process_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/instrumentation.rb:34:in block in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/notifications.rb:168:in block in instrument’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/notifications/instrumenter.rb:23:in instrument'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/notifications.rb:168:in instrument’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/instrumentation.rb:32:in process_action'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/params_wrapper.rb:256:in process_action’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/abstract_controller/base.rb:134:in process'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionview-5.2.4.4/lib/action_view/rendering.rb:32:in process’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/live.rb:255:in block (2 levels) in process'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/dependencies/interlock.rb:42:in block in running’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/concurrency/share_lock.rb:162:in sharing'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/activesupport-5.2.4.4/lib/active_support/dependencies/interlock.rb:41:in running’”, “/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/live.rb:247:in block in process'", "/opt/ood/ondemand/root/usr/share/gems/2.5/ondemand/1.8.18/gems/actionpack-5.2.4.4/lib/action_controller/metal/live.rb:291:in block in new_controller_thread’”]

jeff.ohrstrom · January 12, 2021, 3:22pm

What version of Slurm do you have?

You seem to be hitting an exception on this line. I’m guessing because it cannot find the working directory from the Slurm output.

github.com

OSC/ondemand/blob/master/apps/activejobs/app/models/jobstatusdata.rb#L124


  attributes.push Attribute.new "Node List", self.nodes.join(", ") unless self.nodes.blank?
  attributes.push Attribute.new "Total CPUs", info.native[:cpus]
  attributes.push Attribute.new "Time Limit", info.native[:time_limit]
  attributes.push Attribute.new "Time Used", info.native[:time_used]
  attributes.push Attribute.new "Memory", info.native[:min_memory]
  self.native_attribs = attributes
  self.submit_args = nil
  self.output_path = info.native[:work_dir]
  output_pathname = Pathname.new(info.native[:work_dir])
  self.file_explorer_url = build_file_explorer_url(output_pathname)
  self.shell_url = build_shell_url(output_pathname, self.cluster)
  self
end
# Store additional data about the job. (LSF-specific)
#
# Parses the `native` info function for additional information about jobs on LSF systems.
#   https://github.com/OSC/ood_core/blob/master/spec/job/adapters/lsf_spec.rb

ssood · January 12, 2021, 3:40pm

On the server running the On-Demand software I am running Slurm 18.08. The cluster itself runs Slurm 20.02. I am able to run the various slurm commands from the command line and they work as expected.

jeff.ohrstrom · January 12, 2021, 4:03pm

What does the output from this command look like when you run it from the OOD server itself? Note that I have "-M owens" here, you’ll need to replace owens with your own clustername. And with any copy+pasting you may have to tweak the quotes and so on.

/usr/bin/squeue \
  --all \
  --states=all \
  --noconvert \
  -o \
   "\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b" \
  "-M" \
  "owens"

I get a lot of records and clearly it’s not super easy to see them because we use a funny field separator but if you look closely you should see \u001E is the record seperator so that should be first followed by a bunch of <some caracters>\u001F

\u001Epzs0714\u001F12612121\u001Fo0808\u001F1\u001F1\u001F0\u001F1\u001F2021-01-12T11:52:11\u001F(null)\u001F(null)\u001F12612121\u001FPZS0714\u001F5515\u001FOK\u001F*\u001F12612121\u001F*\u001Fondemand/sys/dashboard/sys/bc_desktop/vdi\u001F*\u001Fstdout=/users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/vdi/output/a1f4cdcd-a3b0-4125-a80a-1755d024d83b/output.log\u001FN/A\u001F1:00:00\u001F51:42\u001F4315M\u001F8:18\u001F\u001Fo0808\u001F(null)\u001F0\u001Fowens-default\u001Fquick\u001F1000500241\u001FNone\u001F2021-01-12T10:52:11\u001FR\u001FRUNNING\u001Fjohrstrom\u001F30961\u001F(null)\u001F2021-01-12T10:52:09\u001F(null)\u001F(null)\u001F\u001FN/A\u001F0\u001F(null)\u001F*:*:*\u001F/users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/vdi/output/a1f4cdcd-a3b0-4125-a80a-1755d024d83b\u001FN/A

jeff.ohrstrom · January 12, 2021, 4:08pm

I would also search your /var/log/ondemand-nginx/<user>/error.log for squeue and see if there are any errors near that entry.

ssood · January 12, 2021, 4:23pm

Running the above command gives the error:

slurm_load_jobs error: Socket timed out on send/recv operation

Strangely enough if I remove the “-M” clustername flag it works

Also there are no squeue errors in the /var/log/ondemand-nginx/<user>/error.log file

jeff.ohrstrom · January 12, 2021, 4:31pm

OK. I think I see. Your v2.job.adapter.cluster is the hostname? It should be the ClusterName field from you slurm.conf

Here’s ours:

v2:
  job:
    adapter: "slurm"
    cluster: "pitzer"
    host: "pitzer-slurm01.ten.osc.edu"
    lib: "/usr/lib64"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"

with ClusterName=pitzer in our /etc/slurm/slurm.conf.

You can remove this cluster attribute, but I believe it’s an optimization to limit queries to the slurm controller if you have multiple clusters.

ssood · January 12, 2021, 6:41pm

Removing the cluster attribute did the trick. Currently only running one cluster.

That was actually a typo the, cluster attribute and host attribute were not the same. I mistyped when I created the post

Topic		Replies	Views
Active Job App not displaying jobs Get Help ondemand2 , question	4	249	August 27, 2023
Active Jobs with SSGE 8.1.9 : Request for jobs failed due to body parsing error Get Help	20	1640	May 26, 2022
Error on the 'active jobs' page, redo Get Help question	9	897	May 26, 2022
Slurm Job Status not showing QOSMaxWallDurationPerJobLimit General Discussion question	0	307	May 23, 2023
Interactive jobs "disappearing" Get Help	7	554	May 19, 2022

Active Jobs not displaying any jobs in the queue

Related Topics