Load testing OOD?

Hi,

Anyone doing load or performance testing using something like Jmeter or Selenium? I saw a few mentions earlier, but I didn’t see anyone using them. I’d like to test the number of simultaneous users we expect to see in the fall and see if there are any problems with our environment or in our OOD setup.

For load and performance testing, the website below recommends using jmeter. There are also many resources online (youtube) that teach jmeter basics.
https://www.browserstack.com/guide/jmeter-vs-selenium#:~:text=JMeter%20and%20Selenium%20are%20both,browser%20testing%20of%20a%20website.

@keenandr what authentication mechanism are you using for your OnDemand instance?

@efranz We’re using OIDC with Google.

@keenandr @efranz Did you figure out load testing with OIDC?

We’ve found success using locust. This gist outlines a script for testing users visiting the dashboard page and refreshing every 50 to 200 seconds. This gist outlines a script for testing users logging into jupyter notebooks and refreshing every 30 to 90 seconds. Let me know if you have any questions.

Load Testing

On a related note, here are some of our findings from load testing. We were surprised by the high load we found on our hosts. Here is an example chart of which comparable results are reproducible across tests:

For reference, this is an old host with 64 cores and 251G of memory. We use this script for monitoring the host during tests:

#!/usr/bin/env bash

while true
do
  cat /proc/loadavg | ts | tee -a loadavg.csv
  free -m | grep Mem | tr -s " " | cut -d " " -f2,3,4,5,6,7 | ts | tee -a memory.csv
  iostat | awk '/avg-cpu/ {getline; $1=$1; print}' | ts | tee -a iostat.csv
  top -b -n 1 | head | awk -F' ' 'FNR == 3 {print $2, $4, $6, $8, $10, $12, $14}' | ts | tee -a cpu_monitor.csv
  sleep 1
done

Has anyone else had problems with high load average like this? This is what our thought process has been thus far:

Analysis

Noticeable metrics

  • The user space %cpu usage stays below 10 on the login node across tests
  • The kernel space %cpu usage goes above 95% on the login node across tests
  • iostat %idle stays above 95% on the login node across tests
  • Storage node maintains a low load average across tests
  • iostat %idle stays above 95% on the storage node across tests

Potential bottlenecks:

  • Compute resources. This is probably not the case because %us stays low throughout the tests.
  • login node filesystem. This is probably not the case because io on login node stays low throughout the tests.
  • Storage node. This is probably not the case because load and io on storage node stays low throughout the tests.
  • Kernel intensive operations on the login node. This could be the bottleneck.

Finding a Potential Culprit

Looking at pidstat -l 5 for more information gives the following snippet (with many more ps processes omitted, note that the following logs are from a different day than the above graph):

Timestamp UID PID %us %sy %guest %CPU CPU Command Args1 Args2
11:55:14 6000346 64735 0.4 43.8 0 44.2 34 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p49618
11:55:14 6000208 64736 0.8 47 0 47.8 33 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p22353
11:55:14 6000242 64737 0.4 39.6 0 40 37 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p30591
11:55:14 6000218 64738 0.2 42.6 0 42.8 27 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p24753
11:55:14 6000256 64739 0.8 55.8 0 56.6 22 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p34137
11:55:14 6000255 64743 1 54.4 0 55.4 36 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p33905
11:55:14 6000239 64745 0.8 47.2 0 48 38 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p29924
11:55:14 6000278 64746 0.6 47.8 0 48.4 41 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p47448
11:55:14 6000342 64748 0.6 36.2 0 36.8 3 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p46552
11:55:14 6000275 64749 0.8 56.4 0 57.2 44 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p47040
11:55:14 6000314 64750 0.4 43.4 0 43.8 40 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p13080
11:55:14 6000230 64751 0.8 59 0 59.8 51 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p27705
11:55:14 6000220 64752 0.6 50.4 0 51 50 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p25261
11:55:14 6000290 64753 1 56.4 0 57.4 10 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p5993
11:55:14 6000237 64754 0.4 46 0 46.4 60 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p29368
11:55:14 6000219 64755 1 50.2 0 51.2 23 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p25023
11:55:14 6000313 64756 0.6 58.2 0 58.8 46 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p13349
11:55:14 6000243 64759 0.8 46 0 46.8 25 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p30841
11:55:14 6000269 64760 0.4 38.4 0 38.8 15 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p45167
11:55:14 6000335 64761 0.4 42.2 0 42.6 7 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p43093
11:55:14 6000327 64762 0.4 39 0 39.4 14 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p40609
11:55:14 6000261 64763 0.4 37 0 37.4 34 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p35589
11:55:14 6000283 64764 0.8 55.4 0 56.2 40 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p48767
11:55:14 6000297 64766 0.6 39.2 0 39.8 0 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p8446
11:55:14 6000312 64767 0.4 44 0 44.4 52 ps -opid,ppid,%cpu,rss,vsize,pgid,uid,command -p12868

The sum of the %sy values for this pidstat output is 6727.6 which would saturate around 68 cores. There is one ps call for each user. They are a child process of their user’s pasenger core. This could explain the high load average. Poking around it seems as though Passenger ProcessMetricsCollector.h is making these ps calls every few seconds.

To check the idea that ps was behind the high load average, we ran the following script:

#!/usr/bin/env bash
while true; do
    pkill -c ps
    sleep 0.5
done

Doing so brought the load average down from ~100 to ~30 which implies that the ps commands are related to the high load average. Unfortunately we haven’t been able to find a way to tune the frequency of the ps commands.

Next Steps

  • Has anyone else had a similar issue and have any suggestions?
  • Try turning off Passenger process metrics collector in order to confirm or reject the idea that it is behind the high load average.
  • Find a way to reduce the frequency of the ps calls.

Notes

  • This test represents atypical behavior. Like mentioned in the OOD docs most user’s will visit the dashboard, start a job, and leave the dashboard. This test could reflect the load that we would expect to see with X users visiting the dashboard within 5 minutes.
2 Likes

I was only able to put two links in my previous post. Here are some more:

I fixed the link issue - everyone should be default trust 1 so this restriction is no longer there. The Discourse defaults were too restricting.

I don’t have any concrete feedback yet on the load test except that:

  1. It is awesome
  2. I wonder how easy it would be to switch to basic auth for the load test
  3. Will definitely look into Passenger process metrics collector and see how it impacts performance

I will think more about all of this and follow up.

1 Like

Great sounds good! I should clarify that the load tests we developed are tailored to LTI authentication, not OIDC. That being said it shouldn’t be especially difficult to switch to basic auth or any other auth. The locust framework makes use of the python-requests library for carrying out load tests, so switching the test to another authentication system would be as difficult as authenticating with python-requests. When I was figuring out how to authenticate LTI with python-requests, I used the Firefox Network Monitor to watch the requests being sent from my browser when logging in and tried to replicate them.

Here the part of the load testing script relevant to authentication:

# NOTE this is where you make the necessary requests to authenticate your user
# NOTE this step will be specific to your authentication system
@task
def authenticate_user(self):

    # lti verification
    self.logger.debug("Posting to lti parse")
    response = self.client.post("/lti/launch", data=self.lti_request_data)

    # waiting page
    self.logger.debug("Visiting the waiting page")
    self.client.get(f"/verify_account_request/verify_account.html")

    # poll for success status of account before continuing, performing the same task that javascript would have on the waiting page
    while True:
        status_response = self.client.get(f"/verify_account_request/status")
        self.logger.info(status_response.json())

        if status_response.json()["status"] == "true":
            break

        time.sleep(10)

The waiting page is specific to our LTI authentication implementation, so you should be able to reduce it down to something like the following for basic auth:

# NOTE this is where you make the necessary requests to authenticate your user
# NOTE this step will be specific to your authentication system
@task
def authenticate_user(self):

    # lti verification
    self.logger.debug("Authenticating user")
    response = self.client.post("/login", auth=HTTPBasicAuth('user', 'pass'))

Here are some potential resources:

Other steps you might have to take:

  • Set up some test user accounts whose credentials can be used in the load tests

Happy to help let me know if you have any questions.