Grafana integrated dashboards not displaying/configuration questions

Hi. I’m trying to get the grafana integration working and I’m getting a little stuck.
I see that it requires the use of the OnDemand Clusters dashboard, so I have that installed and have been working to get that functional.

I have a prometheus configuration set up, and it looks like that piece is set up correctly, the nodes are exporting data, and prometheus has that data being stored. I configured the prometheus.yml file to have this for each node:

(this wasn’t documented that I needed to do this, I figured that by looking at the variables in the grafana dashboard). Also, in the documentation, the relabel_configs has [address] in quotes and prometheus (2.27.1) didn’t like that, but taking the ’ ’ out made that work.

Now, the CPU Load and Memory Usage graphs are loading stuff, but CPU usage has no data, it’s looking for node_cpu_load_system which isn’t an item being served up by prometheus, it has node_cpu_seconds_total of various types, but not that particular metric. Is this an incompatibility with the version of node exporter? (I have version 1.1.2)

Also, for the moab graphs, I’m using slurm, but there’s no info in the documentation anywhere on what this is looking for. I’m sure I can make it work for slurm, but what’s the prometheus config in use, or collection information that should be set to get valid data?

Lastly, the Active Job Dashboard in OOD (2.0.8) does not show the integrated graphs. If I expand a job I get a blank screen with the job info and a Detailed Metrics link. If I click on that link, I get the grafana page with the data.

I know that was a lot, but thanks for any help you can offer.

Also, for the other exporters, before I go down the rabbit hole of getting them installed on my nodes (the cgroup and nvidia ones), is there a prometheus config specific for them as well?

I’m interested in this info as well. I was never able to get these working in our setup and need to revisit it this summer.

@tdockendorf please advise. Do we use the node exporter and rules to make any new metrics?

I just pushed a new version of OnDemand Clusters dashboard with SLURM dashboards instead of Moab and using NVIDIA’s DCGM exporter. The SLURM exporter we use is a fork with major modifications, but for the OnDemand Clusters dashboard, the upstream repo may work but can’t guarantee that since we rely on the heavily modified fork.

This all works with Prometheus 2.26.0

Example config for cgroup exporter that filters out process and Go metrics since we run this on ~1400 compute nodes and don’t really care about those metrics.

- job_name: cgroup
  relabel_configs:
  - source_labels: [__address__]
    regex: "([^.]+)..*"
    replacement: "$1"
    target_label: host
  metric_relabel_configs:
  - source_labels: [__name__,role]
    regex: "^(go|process|promhttp)_.*;compute"
    action: drop
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd_config.d/cgroup_*.yaml"

We use Puppet to generate the actual scrape target configs, here is an example:

# cat /etc/prometheus/file_sd_config.d/cgroup_cgroup-p0001.yaml 
# this file is managed by puppet; changes will be overwritten
---
- targets:
  - p0001.ten.osc.edu:9306
  labels:
    host: p0001
    cluster: pitzer
    environment: production
    switch: eth-pitzer-rack03h2
    role: compute

If you put host in the scrape config, you don’t need the metric relabeling logic to generate host label. Most of our exporters follow this pattern for generating the scrape configs. All our scrapes are 1 minute except GPFS which is 3 minutes and we drop more things with GPFS to only focus on exactly the metrics we care about:

- job_name: gpfs
  scrape_timeout: 2m
  scrape_interval: 3m
  relabel_configs:
  - source_labels: [__address__]
    regex: "([^.]+)..*"
    replacement: "$1"
    target_label: host
  metric_relabel_configs:
  - regex: "^(nodename)$"
    action: labeldrop
  - source_labels: [__name__,role]
    regex: gpfs_(mount|health|verbs)_status;compute
    action: drop
  - source_labels: [__name__,collector,role]
    regex: gpfs_exporter_(collect_error|collector_duration_seconds);(mmhealth|mount|verbs);compute
    action: drop
  - source_labels: [__name__,role]
    regex: "^(go|process|promhttp)_.*;compute"
    action: drop
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd_config.d/gpfs_*.yaml"

For the CPU load we use record rules to speed up the loading since our nodes have anywhere from 28 to 96 cores and so per-core metrics take a long time to load when doing whole clusters.

groups:
- name: node
  rules:
  - record: node:cpus:count
    expr: count by(host,cluster,role) (node_cpu_info)
  - record: node:cpu_load_user:avg5m
    expr: avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="user"}[5m]))
  - record: node:cpu_load_system:avg5m
    expr: avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="system"}[5m]))
  - record: node:cpu_load_iowait:avg5m
    expr: avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="iowait"}[5m]))
  - record: node:cpu_load_total:avg5m
    expr: 1 - avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="idle"}[5m]))
  - record: node:network_received_rate_bytes
    expr: irate(node_network_receive_bytes_total[5m])
  - record: node:network_transmit_rate_bytes
    expr: irate(node_network_transmit_bytes_total[5m])

Let us know if these things work for you.