Error on the 'active jobs' page, redo

georges · April 9, 2020, 11:16pm

I am running the latest ansible-ood, I have configured the values in defaults/main.yml for our cluster. I also checked the documentation online and I do have it defined in /opt/ood/config/clusters.d/<our_cluster>/yml for slurm, however, it seems the only things that comes up on the screen is: “No clusters found for cluster id: all” .
I have also submitted a job manually to slurm as the configured user, and it (because of) won’t show up in the page.
Is there any suggestions that you can send over?

jeff.ohrstrom · April 10, 2020, 1:31pm

Yea, let’s see what the configuration you’re using is. Either the produced file in clusters.d or the ansible configuration used to generate it. I assume all isn’t a clustername you’re using?

georges · April 10, 2020, 4:45pm

From the ansible config:
## Manage cluster example, default undef
cluster:
v2:
metadata:
title: “Apollo Cluster”
login:
host: “our hostname put here”
job:
adapter: slurm
bin: /usr/local/slurm/bin

In the clusters.d file:

cluster:
v2:
metadata:
title: “Apollo Test Cluster”
url: “<hostname_here>”
hidden: false
login:
host: “localhost”
job:
adapter: “slurm”
cluster: “apollo”
host: “node name here”
bin: “/usr/local/slurm/bin”
conf: “/usr/local/slurm/etc/slurm.conf”

jeff.ohrstrom · April 10, 2020, 5:02pm

I’m thinking this could be slurm thing. You have some issues with ansible and configurations which I’ve given below, but that’s the main issue becuase it looks like you’ve configured what’s in cluster.d manually and it looks fine.

What does squeue --all command ran in a shell return? I’m thinking that’s where our issue lies. That slurm somehow doesn’t like this flag. Did you happen to change the SQUEUE_ALL environment variable?

On your ansible configurations, the ansible configuration is 1:1 with what get’s written out. So the fact that the config is different than what exists, tells me something’s wrong there.

Also your ansible config should be clusters plural, then the next item in the map is the name of the cluster (and the name of the file created). Here’s a super simple example of 2 clusters, on slurm and one pbs.

clusters:
  titan:
    v2:
      job:
        adapter: slurm
  io:
    v2:
      job:
        adapter: lsf

georges · April 10, 2020, 5:33pm

I fixed the plural (s). Here is my squeue --all:
squeue --all
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13 batch sleep spock R 0:30 1 <hostname_here>

is there something else I need to reset?

jeff.ohrstrom · April 10, 2020, 5:56pm

Sorry, you’re right. There is something wrong with the configuration you have in /etc/ood/config/clusters.d, specifically what you have isn’t being read correctly. Be sure to restart your web server (from a help dropdown up at the top right) from time to time when you want things to be re-read.

If after restarting and confirming you still have issues, can you copy what you’ve configured in ansible here with formatting kept in tact? You keep formatting by using tildas and specifying yaml like below.

```yaml
the_yaml: you want to write
```

For fuller reference, here’s what my ansible configuration looks like on a new adapter I’m working on.

clusters:
  titan:
    v2:
      metadata:
        title: "{{ scheduler }}"
      login:
        host: localhost
      job:
        adapter: ccq
        cloud: 'gcp'
        scheduler: '{{ scheduler }}'
        image: "{{ cc_beta_compute_img }}"
      batch_connect:
        vnc:
          header: "#!/bin/bash"
          script_wrapper: "export PATH=$PATH:/opt/TurboVNC/bin\n%s"
          websockify_cmd: '/usr/bin/websockify'
        basic:
          header: "#!/bin/bash"

georges · April 10, 2020, 6:43pm

clusters:
   v2:
     metadata:
        title: "Apollo Test Cluster"
        url: "<hostname_here>"
        hidden: false
     login:
        host: "localhost"
     job:
        adapter: "slurm"
        cluster: "apollo"
        host: "<hostname_here>"
        bin: "/usr/local/slurm/bin"
        conf: "/usr/local/slurm/etc/slurm.conf"
    batch_connect:
        basic:
           script_wrapper: "module restore\n%s"
        vnc:
           script_wrapper: "module restore\nmodule load ondemand-vnc\n%s"

jeff.ohrstrom · April 10, 2020, 6:48pm

OK yea, you’re missing the middle key in clusters.<cluster name>.v2.

Like this config

clusters:
  apollo: # this is the key here that specifies the filename
    v2:
      metadata:
        ...

that’ll write out apollo.yml

v2:
  metadata:
    ...

georges · April 10, 2020, 7:00pm

We are online with slurm now! That worked!
Thanks

Topic		Replies	Views
Simple initial configure questions Get Help question	2	864	January 9, 2023
Multiple SLURM clusters and OnDemand Get Help question	5	1404	May 17, 2022
SLURM initial config and testing Get Help question	11	673	August 14, 2022
Configless Slurm and path to configuration file Get Help question	6	291	January 10, 2024
Error at the top of OOD page on new setup Get Help	14	97	December 14, 2023

Error on the 'active jobs' page, redo

In the clusters.d file:

Related Topics