Error on the 'active jobs' page, redo

I am running the latest ansible-ood, I have configured the values in defaults/main.yml for our cluster. I also checked the documentation online and I do have it defined in /opt/ood/config/clusters.d/<our_cluster>/yml for slurm, however, it seems the only things that comes up on the screen is: “No clusters found for cluster id: all” .
I have also submitted a job manually to slurm as the configured user, and it (because of) won’t show up in the page.
Is there any suggestions that you can send over?

Yea, let’s see what the configuration you’re using is. Either the produced file in clusters.d or the ansible configuration used to generate it. I assume all isn’t a clustername you’re using?

From the ansible config:
## Manage cluster example, default undef
cluster:
v2:
metadata:
title: “Apollo Cluster”
login:
host: “our hostname put here”
job:
adapter: slurm
bin: /usr/local/slurm/bin

In the clusters.d file:

cluster:
v2:
metadata:
title: “Apollo Test Cluster”
url: “<hostname_here>”
hidden: false
login:
host: “localhost”
job:
adapter: “slurm”
cluster: “apollo”
host: “node name here”
bin: “/usr/local/slurm/bin”
conf: “/usr/local/slurm/etc/slurm.conf”

I’m thinking this could be slurm thing. You have some issues with ansible and configurations which I’ve given below, but that’s the main issue becuase it looks like you’ve configured what’s in cluster.d manually and it looks fine.

What does squeue --all command ran in a shell return? I’m thinking that’s where our issue lies. That slurm somehow doesn’t like this flag. Did you happen to change the SQUEUE_ALL environment variable?

On your ansible configurations, the ansible configuration is 1:1 with what get’s written out. So the fact that the config is different than what exists, tells me something’s wrong there.

Also your ansible config should be clusters plural, then the next item in the map is the name of the cluster (and the name of the file created). Here’s a super simple example of 2 clusters, on slurm and one pbs.

clusters:
  titan:
    v2:
      job:
        adapter: slurm
  io:
    v2:
      job:
        adapter: lsf

I fixed the plural (s). Here is my squeue --all:
squeue --all
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13 batch sleep spock R 0:30 1 <hostname_here>

is there something else I need to reset?

Sorry, you’re right. There is something wrong with the configuration you have in /etc/ood/config/clusters.d, specifically what you have isn’t being read correctly. Be sure to restart your web server (from a help dropdown up at the top right) from time to time when you want things to be re-read.

If after restarting and confirming you still have issues, can you copy what you’ve configured in ansible here with formatting kept in tact? You keep formatting by using tildas and specifying yaml like below.

```yaml
the_yaml: you want to write
```

For fuller reference, here’s what my ansible configuration looks like on a new adapter I’m working on.

clusters:
  titan:
    v2:
      metadata:
        title: "{{ scheduler }}"
      login:
        host: localhost
      job:
        adapter: ccq
        cloud: 'gcp'
        scheduler: '{{ scheduler }}'
        image: "{{ cc_beta_compute_img }}"
      batch_connect:
        vnc:
          header: "#!/bin/bash"
          script_wrapper: "export PATH=$PATH:/opt/TurboVNC/bin\n%s"
          websockify_cmd: '/usr/bin/websockify'
        basic:
          header: "#!/bin/bash"
clusters:
   v2:
     metadata:
        title: "Apollo Test Cluster"
        url: "<hostname_here>"
        hidden: false
     login:
        host: "localhost"
     job:
        adapter: "slurm"
        cluster: "apollo"
        host: "<hostname_here>"
        bin: "/usr/local/slurm/bin"
        conf: "/usr/local/slurm/etc/slurm.conf"
    batch_connect:
        basic:
           script_wrapper: "module restore\n%s"
        vnc:
           script_wrapper: "module restore\nmodule load ondemand-vnc\n%s"

OK yea, you’re missing the middle key in clusters.<cluster name>.v2.

Like this config

clusters:
  apollo: # this is the key here that specifies the filename
    v2:
      metadata:
        ...

that’ll write out apollo.yml

v2:
  metadata:
    ...

We are online with slurm now! That worked!
Thanks