Grafana integrated dashboards not displaying/configuration questions

groucho64738 · June 7, 2021, 2:30pm

Hi. I’m trying to get the grafana integration working and I’m getting a little stuck.
I see that it requires the use of the OnDemand Clusters dashboard, so I have that installed and have been working to get that functional.

I have a prometheus configuration set up, and it looks like that piece is set up correctly, the nodes are exporting data, and prometheus has that data being stored. I configured the prometheus.yml file to have this for each node:

targets: [‘clusternodeA.domain.org:9100’]
labels:
role: compute
cluster: mycluster

(this wasn’t documented that I needed to do this, I figured that by looking at the variables in the grafana dashboard). Also, in the documentation, the relabel_configs has [address] in quotes and prometheus (2.27.1) didn’t like that, but taking the ’ ’ out made that work.

Now, the CPU Load and Memory Usage graphs are loading stuff, but CPU usage has no data, it’s looking for node_cpu_load_system which isn’t an item being served up by prometheus, it has node_cpu_seconds_total of various types, but not that particular metric. Is this an incompatibility with the version of node exporter? (I have version 1.1.2)

Also, for the moab graphs, I’m using slurm, but there’s no info in the documentation anywhere on what this is looking for. I’m sure I can make it work for slurm, but what’s the prometheus config in use, or collection information that should be set to get valid data?

Lastly, the Active Job Dashboard in OOD (2.0.8) does not show the integrated graphs. If I expand a job I get a blank screen with the job info and a Detailed Metrics link. If I click on that link, I get the grafana page with the data.

I know that was a lot, but thanks for any help you can offer.

groucho64738 · June 7, 2021, 2:44pm

Also, for the other exporters, before I go down the rabbit hole of getting them installed on my nodes (the cgroup and nvidia ones), is there a prometheus config specific for them as well?

dsajdak · June 11, 2021, 1:49pm

I’m interested in this info as well. I was never able to get these working in our setup and need to revisit it this summer.

jeff.ohrstrom · June 16, 2021, 3:20pm

@tdockendorf please advise. Do we use the node exporter and rules to make any new metrics?

tdockendorf · June 16, 2021, 3:36pm

I just pushed a new version of OnDemand Clusters dashboard with SLURM dashboards instead of Moab and using NVIDIA’s DCGM exporter. The SLURM exporter we use is a fork with major modifications, but for the OnDemand Clusters dashboard, the upstream repo may work but can’t guarantee that since we rely on the heavily modified fork.

This all works with Prometheus 2.26.0

Example config for cgroup exporter that filters out process and Go metrics since we run this on ~1400 compute nodes and don’t really care about those metrics.

- job_name: cgroup
  relabel_configs:
  - source_labels: [__address__]
    regex: "([^.]+)..*"
    replacement: "$1"
    target_label: host
  metric_relabel_configs:
  - source_labels: [__name__,role]
    regex: "^(go|process|promhttp)_.*;compute"
    action: drop
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd_config.d/cgroup_*.yaml"

We use Puppet to generate the actual scrape target configs, here is an example:

# cat /etc/prometheus/file_sd_config.d/cgroup_cgroup-p0001.yaml 
# this file is managed by puppet; changes will be overwritten
---
- targets:
  - p0001.ten.osc.edu:9306
  labels:
    host: p0001
    cluster: pitzer
    environment: production
    switch: eth-pitzer-rack03h2
    role: compute

If you put host in the scrape config, you don’t need the metric relabeling logic to generate host label. Most of our exporters follow this pattern for generating the scrape configs. All our scrapes are 1 minute except GPFS which is 3 minutes and we drop more things with GPFS to only focus on exactly the metrics we care about:

- job_name: gpfs
  scrape_timeout: 2m
  scrape_interval: 3m
  relabel_configs:
  - source_labels: [__address__]
    regex: "([^.]+)..*"
    replacement: "$1"
    target_label: host
  metric_relabel_configs:
  - regex: "^(nodename)$"
    action: labeldrop
  - source_labels: [__name__,role]
    regex: gpfs_(mount|health|verbs)_status;compute
    action: drop
  - source_labels: [__name__,collector,role]
    regex: gpfs_exporter_(collect_error|collector_duration_seconds);(mmhealth|mount|verbs);compute
    action: drop
  - source_labels: [__name__,role]
    regex: "^(go|process|promhttp)_.*;compute"
    action: drop
  file_sd_configs:
  - files:
    - "/etc/prometheus/file_sd_config.d/gpfs_*.yaml"

For the CPU load we use record rules to speed up the loading since our nodes have anywhere from 28 to 96 cores and so per-core metrics take a long time to load when doing whole clusters.

groups:
- name: node
  rules:
  - record: node:cpus:count
    expr: count by(host,cluster,role) (node_cpu_info)
  - record: node:cpu_load_user:avg5m
    expr: avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="user"}[5m]))
  - record: node:cpu_load_system:avg5m
    expr: avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="system"}[5m]))
  - record: node:cpu_load_iowait:avg5m
    expr: avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="iowait"}[5m]))
  - record: node:cpu_load_total:avg5m
    expr: 1 - avg by (host,cluster,role)(irate(node_cpu_seconds_total{mode="idle"}[5m]))
  - record: node:network_received_rate_bytes
    expr: irate(node_network_receive_bytes_total[5m])
  - record: node:network_transmit_rate_bytes
    expr: irate(node_network_transmit_bytes_total[5m])

Let us know if these things work for you.

groucho64738 · June 21, 2021, 6:03pm

Hi. I was away for a few days and I’ll get back into looking at this this week. Thanks for the response.

groucho64738 · July 22, 2021, 12:55pm

Hi. I’m digging into this further right now. One note on the ondemand clusters grafana dashboard, I think the variable for ‘jobid’ should be: label_values(cgroup_info{role=“compute”,cluster="$cluster",host="$host"}, jobid)

I got info back from the variable when using cgroup_info instead of node_uname_info. Also, changing the Custom all value in the variable for cluster made the hosts resolve.

So, now I have prometheus set up with the rules, data come in from the slurm/node/cgroup exporters and graphs in grafana.

Inside of Ondemand though, if I look at All Jobs and expand a job, there are no graphs appearing for individual jobs…I don’t even see a query being sent to the grafana server. If I click on Detailed Metrics, however, it does bring up the entire grafana dashboard with the specific host information set, so that part is working, just not the inline graph when viewing job info.

Looking into the pun error log, I only see this:
INFO “method=GET path=/pun/sys/dashboard/activejobs/json format=json controller=ActiveJobsController action=json status=200 duration=67.46 view=9.00”

The access log shows:
“GET /pun/sys/dashboard/activejobs/json?pbsid=XXXXXX&cluster=mycluster HTTP/1.1” 200 6754 “https://ondemand.mydomain.org/pun/sys/dashboard/activejobs?jobcluster=all&jobfilter=all” “Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0”

if that helps

groucho64738 · July 22, 2021, 2:07pm

Some more info, I’m using the Inspector in firefox to look at the web elements, and I do see that the embedded iframes look ok. If I copy/paste those urls directly in the browser it brings up single panels.

groucho64738 · August 12, 2021, 3:42pm

@tdockendorf just a bump on this one if you have any thoughts. Thanks.

tdockendorf · August 12, 2021, 4:04pm

One note on the ondemand clusters grafana dashboard, I think the variable for ‘jobid’ should be: label_values(cgroup_info{role=“compute”,cluster=“$cluster”,host=“$host”}, jobid)

I did this so there isn’t a jobid lookup as well as hiding the field, mostly at OSC so our users don’t get a form to type in other people’s job IDs. It’s really just historical how we handle users looking at data from other user’s jobs. The actual URL field jobid= in URL query parameters still works.

There are a few possible reasons the job viewer panels aren’t showing up, one could be incorrect cluster config inside OnDemand. Can you share yours? This is what OSC uses for Grafana configs in /etc/ood/config/clusters.d/pitzer.yml

  custom:
    grafana:
      host: "https://grafana.osc.edu"
      orgId: 1
      dashboard:
        name: "ondemand-clusters"
        uid: "aaba6Ahbauquag"
        panels:
          cpu: 20
          memory: 24
      labels:
        cluster: "cluster"
        host: "host"
        jobid: "jobid"

The “panels” ID values must match what you have in Grafana, otherwise the direct link won’t work. You can usually find those by going to dashboard and click on one panel and do “View” and the value for viewPanel is what you want to use.

The other possible reason things don’t show up in iframes is Grafana is now allowing iframes. For our grafana.ini we have in security section this setting:

allow_embedding = true

We no longer use Anonymous authentication, instead we put Grafana behind Apache reverse proxy and enable Apache mod_auth_openidc pointed at Keycloak which is what we use for OnDemand so once someone is logged into OnDemand that login is good for Grafana too. This is what it looks like for auth proxy in Grafana:

[auth.proxy]
enabled = true
header_name = OIDC_CLAIM_preferred_username
headers = Email:OIDC_CLAIM_email Name:OIDC_CLAIM_name
whitelist = 127.0.0.1, ::1

We mix auth proxy with LDAP group mapping so that once someone is logged in via the auth proxy, their group membership is pulled from LDAP using the username from OIDC claims. I can share more details on that if that’s useful.

groucho64738 · August 20, 2021, 12:46pm

v2:
metadata:
title: “my-cluster”
login:
host: “my-cluster.example.org”
job:
adapter: “slurm”
bin: “/bin”
conf: “/etc/slurm/slurm.conf”
acls:

adapter: “group”
groups:
- “cluster-access-group”
  type: “whitelist”
  custom:
  grafana:
  host: “http://ood-graph:3000”
  orgId: 1
  dashboard:
  name: “ondemand-clusters”
  uid: “aaba6Ahbauquag”
  panels:
  cpu: 10
  memory: 24
  labels:
  cluster: “cluster”
  host: “host”
  jobid: “jobid”

In our grafana.ini file we also have allow_embedding = true, and are still using anonymous access to it. If I click on the view dashboard link, the panel definitely appears, so we have the right one. The embedded graphs definitely don’t even load, and the server doesn’t even see the attempt to load a graph (right or wrong)

tdockendorf · August 20, 2021, 1:27pm

Could you re-upload your cluster YAML within triple backticks or some other preformatted block? The lack of formatting makes things almost impossible to see correctly.

Looking at the code, if you have Grafana defined like my pasted example, where the key v2.custom.grafana exists, the graphs will show up. If you open the job viewer with job details and inspect the HTML in browser with like Chrome developer tools ,you should see 2 iframe blocks in the job details section under the dashboard links. If you do see the iframes then the OnDemand code is rendering the correct HTML but what’s getting rendered based on configs could be incorrect.

github.com

OSC/ondemand/blob/21c7e270576b034be50c01d245e5a29b47423644/apps/dashboard/app/views/active_jobs/_job_details_node_view.html.erb#L11

    
      
          <% if has_ganglia(data.cluster) || has_grafana(data.cluster) %>
          <div class="card ml-5 mt-3">
                  <% if has_ganglia(data.cluster) %>
                    <div class="card-header">Node: <%= node %> <span class="float-right">Job: <%= data.pbsid %> </span></div>
                    <div class="card-body">
                      <%= link_to  image_tag( build_ganglia_link(data.cluster, data.starttime, 'cpu_report', node, 'small'), class:"img-responsive col-lg-3 col-md-3 col-sm-6 col-xs-6" ), build_ganglia_link(data.cluster, data.starttime, 'cpu_report', node, 'large'), data: { lightbox: "cpu-report", title: "CPU Report " + node } %>
                      <%= link_to  image_tag( build_ganglia_link(data.cluster, data.starttime, 'load_report', node, 'small'), class:"img-responsive col-lg-3 col-md-3 col-sm-6 col-xs-6" ), build_ganglia_link(data.cluster, data.starttime, 'load_report', node, 'large'), data: { lightbox: "load-report", title: "Load Report " + node } %>
                      <%= link_to  image_tag( build_ganglia_link(data.cluster, data.starttime, 'mem_report', node, 'small'), class:"img-responsive col-lg-3 col-md-3 col-sm-6 col-xs-6" ), build_ganglia_link(data.cluster, data.starttime, 'mem_report', node, 'large'), data: { lightbox: "mem-report", title: "Memory Report " + node } %>
                      <%= link_to  image_tag( build_ganglia_link(data.cluster, data.starttime, 'network_report', node, 'small'), class:"img-responsive col-lg-3 col-md-3 col-sm-6 col-xs-6" ), build_ganglia_link(data.cluster, data.starttime, 'network_report', node, 'large'), data: { lightbox: "network-report", title: "Network Report " + node } %>
                    </div>
                  <% elsif has_grafana(data.cluster) %>
                    <div class="card-header">
                      Node: <%= node %>
                      <span class="ml-3">Job: <%= data.pbsid %></span>
                      <span class="ml-3">
                        <a href="<%= build_grafana_link(data.cluster, data.starttime, 'node', node) %>" target="_blank">
                              <span class="fa fa-external-link-square-alt"></span> Detailed Metrics
                        </a>
                      </span>
                    </div>
                    <div class="card-body">

github.com

OSC/ondemand/blob/21c7e270576b034be50c01d245e5a29b47423644/apps/dashboard/app/helpers/active_jobs_helper.rb#L58

    
      
            grafana_uri
          rescue StandardError => e
            puts "ERROR: #{e}"
            nil
          end
          
          
def has_ganglia(host)
            OODClusters[host].try { |cluster| cluster.custom_allow?(:ganglia) } || false
          end
          
          
def has_grafana(host)
            OODClusters[host].try { |cluster| cluster.custom_allow?(:grafana) } || false
          end
          
          
def status_label(status)
            case status
            when "completed"
              label = "Completed"
              labelclass = "badge-success"
            when "running"
              label = "Running"

tdockendorf · August 20, 2021, 1:43pm

Thought might help to see example of what things look like to get an idea of what I was describing for inspecting the HTML:

This is the HTML for section under Detailed Metrics:

<div class="card-body">
  <iframe src="https://grafana.osc.edu/d-solo/aaba6Ahbauquag/ondemand-clusters/?orgId=1&amp;theme=light&amp;from=1629465358000&amp;to=now&amp;var-cluster=pitzer&amp;var-host=p0218&amp;panelId=20&amp;var-jobid=4862789" width="450" height="200" frameborder="0"></iframe>
  <iframe src="https://grafana.osc.edu/d-solo/aaba6Ahbauquag/ondemand-clusters/?orgId=1&amp;theme=light&amp;from=1629465358000&amp;to=now&amp;var-cluster=pitzer&amp;var-host=p0218&amp;panelId=24&amp;var-jobid=4862789" width="450" height="200" frameborder="0"></iframe>
</div>

I used Chrome developer tools to “inspect” a section of the page which shows the HTML.

groucho64738 · August 23, 2021, 12:14pm

hmm…I think I might see an issue (didn’t appear in the firefox page inspector, but did in chrome): The main web page was loaded over https, but the graph is coming in over http, so it’s being reported as an insecure frame and the request was blocked. Let me see if I can resolve that.

…looks like that might be the ticket. I now get two frames with errors in them (because of the unknown certificate that I just set up, I’ll have to see what’s requiring the approved cert, the ondemand system or my desktop)

groucho64738 · August 23, 2021, 12:50pm

Almost there I think. I now see frames, but a grafana error appears:
If you’re seeing this Grafana has failed to load its application files

This could be caused by your reverse proxy settings.
If you host grafana under subpath make sure your grafana.ini root_url setting includes subpath. If not using a reverse proxy make sure to set serve_from_sub_path to true.
If you have a local dev build make sure you build frontend using: yarn start, yarn start:hot, or yarn build
Sometimes restarting grafana-server can help
Check if you are using a non-supported browser. For more information, refer to the list of supported browsers.

Weird thing is if I inspect the page and manually copy/paste the url, it comes up fine in my browser.

I do see an access attempt in grafana now when I load that page.

t=2021-08-23T08:49:34-0400 lvl=info msg=“Request Completed” logger=context userId=0 orgId=1 uname= method=GET path=/d-solo/aaba6Ahbauquag/ondemand-clusters/ status=200 remote_addr=xx.xx.xx.xx time_ms=1 size=32653 referer=https://ondemand.example.org/

tdockendorf · August 23, 2021, 1:28pm

Is your Grafana behind a reverse proxy or did you set it up to run with SSL directly on port 3000 or port 443? We use Grafana behind an Apache reverse proxy and it’s Apache doing SSL:

<VirtualHost *:443>
  ServerName grafana.osc.edu

  <Location "/">
    <RequireAny>
      Require valid-user
    </RequireAny>
    AuthType openid-connect
  </Location>

  ## Logging
  ErrorLog "/var/log/httpd/grafana.osc.edu_error_ssl.log"
  ServerSignature Off
  CustomLog "/var/log/httpd/grafana.osc.edu_access_ssl.log" combined 

  ## Proxy rules
  ProxyRequests Off
  ProxyPreserveHost On
  ProxyPass / http://localhost:3000/
  ProxyPassReverse / http://localhost:3000/

  ## SSL directives
  SSLEngine on
  SSLCertificateFile      "/etc/pki/tls/certs/grafana.infra.osc.edu.crt"
  SSLCertificateKeyFile   "/etc/pki/tls/private/grafana.infra.osc.edu.key"
  SSLCertificateChainFile "/etc/pki/tls/certs/grafana.infra.osc.edu-interm.crt"

  OIDCProviderMetadataURL https://idp.osc.edu/auth/realms/osc/.well-known/openid-configuration
  OIDCClientID grafana.osc.edu
  OIDCRedirectURI https://grafana.osc.edu/redirect_uri
  OIDCCookie grafana_production
  OIDCRemoteUserClaim preferred_username
  OIDCClientSecret OMIT
  OIDCCryptoPassphrase OMIT
  OIDCSessionInactivityTimeout 28800
  OIDCStateMaxNumberOfCookies 10 true
</VirtualHost>

groucho64738 · August 23, 2021, 2:22pm

We have it direct on port 3000, but I can do the proxy and see if that makes a difference.

groucho64738 · August 23, 2021, 2:29pm

it does the same thing. Which version of grafana are you on?

tdockendorf · August 23, 2021, 2:39pm

For Grafana if you use reverse proxy there are some config changes needed, mostly the root_url and domain. The domain and root_url might even be good without reverse proxy, but we’ve never tested that.

[server]
domain = grafana.osc.edu
enable_gzip = true
http_port = 3000
root_url = https://grafana.osc.edu

We are on Grafana 8.0.4.

groucho64738 · August 23, 2021, 2:49pm

Ok. I think I have it now. I was actually updating the grafana.ini when you wrote back. I do have graphs now. I got a templating error pop up, but I want to upgrade grafana anyway, so that might just go away.
Thanks for you help on this.

Topic		Replies	Views
System status app in 2.0.23 Get Help	5	308	March 29, 2023
Unable to get interactive desktops running Get Help	45	1023	April 6, 2024
Problem accessing a newly defined cluster Get Help	2	194	October 16, 2023
Active Job App not displaying jobs Get Help ondemand2 , question	4	251	August 27, 2023
Job composer and desktop not working Get Help ondemand2 , question	8	165	March 11, 2024

Grafana integrated dashboards not displaying/configuration questions

Related Topics