Load testing OOD?

keenandr · June 24, 2020, 7:42pm

Hi,

Anyone doing load or performance testing using something like Jmeter or Selenium? I saw a few mentions earlier, but I didn’t see anyone using them. I’d like to test the number of simultaneous users we expect to see in the fall and see if there are any problems with our environment or in our OOD setup.

matthu017 · June 25, 2020, 4:59pm

For load and performance testing, the website below recommends using jmeter. There are also many resources online (youtube) that teach jmeter basics.
https://www.browserstack.com/guide/jmeter-vs-selenium#:~:text=JMeter%20and%20Selenium%20are%20both,browser%20testing%20of%20a%20website.

efranz · June 25, 2020, 5:56pm

@keenandr what authentication mechanism are you using for your OnDemand instance?

keenandr · June 25, 2020, 6:19pm

@efranz We’re using OIDC with Google.

Ruborcalor · August 21, 2020, 6:44pm

@keenandr @efranz Did you figure out load testing with OIDC?

We’ve found success using locust. This gist outlines a script for testing users visiting the dashboard page and refreshing every 50 to 200 seconds. This gist outlines a script for testing users logging into jupyter notebooks and refreshing every 30 to 90 seconds. Let me know if you have any questions.

Load Testing

On a related note, here are some of our findings from load testing. We were surprised by the high load we found on our hosts. Here is an example chart of which comparable results are reproducible across tests:

For reference, this is an old host with 64 cores and 251G of memory. We use this script for monitoring the host during tests:

#!/usr/bin/env bash

while true
do
  cat /proc/loadavg | ts | tee -a loadavg.csv
  free -m | grep Mem | tr -s " " | cut -d " " -f2,3,4,5,6,7 | ts | tee -a memory.csv
  iostat | awk '/avg-cpu/ {getline; $1=$1; print}' | ts | tee -a iostat.csv
  top -b -n 1 | head | awk -F' ' 'FNR == 3 {print $2, $4, $6, $8, $10, $12, $14}' | ts | tee -a cpu_monitor.csv
  sleep 1
done

Has anyone else had problems with high load average like this? This is what our thought process has been thus far:

Analysis

Noticeable metrics

The user space %cpu usage stays below 10 on the login node across tests
The kernel space %cpu usage goes above 95% on the login node across tests
iostat %idle stays above 95% on the login node across tests
Storage node maintains a low load average across tests
iostat %idle stays above 95% on the storage node across tests

Potential bottlenecks:

Compute resources. This is probably not the case because %us stays low throughout the tests.
login node filesystem. This is probably not the case because io on login node stays low throughout the tests.
Storage node. This is probably not the case because load and io on storage node stays low throughout the tests.
Kernel intensive operations on the login node. This could be the bottleneck.

Finding a Potential Culprit

Looking at pidstat -l 5 for more information gives the following snippet (with many more ps processes omitted, note that the following logs are from a different day than the above graph):

Timestamp	UID	PID	%us	%sy	%CPU	CPU	Command	Args1	Args2
11:55:14	6000346	64735	0.4	43.8	44.2	34	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p49618
11:55:14	6000208	64736	0.8	47	47.8	33	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p22353
11:55:14	6000242	64737	0.4	39.6	40	37	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p30591
11:55:14	6000218	64738	0.2	42.6	42.8	27	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p24753
11:55:14	6000256	64739	0.8	55.8	56.6	22	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p34137
11:55:14	6000255	64743	1	54.4	55.4	36	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p33905
11:55:14	6000239	64745	0.8	47.2	48	38	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p29924
11:55:14	6000278	64746	0.6	47.8	48.4	41	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p47448
11:55:14	6000342	64748	0.6	36.2	36.8	3	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p46552
11:55:14	6000275	64749	0.8	56.4	57.2	44	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p47040
11:55:14	6000314	64750	0.4	43.4	43.8	40	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p13080
11:55:14	6000230	64751	0.8	59	59.8	51	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p27705
11:55:14	6000220	64752	0.6	50.4	51	50	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p25261
11:55:14	6000290	64753	1	56.4	57.4	10	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p5993
11:55:14	6000237	64754	0.4	46	46.4	60	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p29368
11:55:14	6000219	64755	1	50.2	51.2	23	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p25023
11:55:14	6000313	64756	0.6	58.2	58.8	46	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p13349
11:55:14	6000243	64759	0.8	46	46.8	25	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p30841
11:55:14	6000269	64760	0.4	38.4	38.8	15	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p45167
11:55:14	6000335	64761	0.4	42.2	42.6	7	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p43093
11:55:14	6000327	64762	0.4	39	39.4	14	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p40609
11:55:14	6000261	64763	0.4	37	37.4	34	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p35589
11:55:14	6000283	64764	0.8	55.4	56.2	40	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p48767
11:55:14	6000297	64766	0.6	39.2	39.8	0	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p8446
11:55:14	6000312	64767	0.4	44	44.4	52	ps	-opid,ppid,%cpu,rss,vsize,pgid,uid,command	-p12868

The sum of the %sy values for this pidstat output is 6727.6 which would saturate around 68 cores. There is one ps call for each user. They are a child process of their user’s pasenger core. This could explain the high load average. Poking around it seems as though Passenger ProcessMetricsCollector.h is making these ps calls every few seconds.

To check the idea that ps was behind the high load average, we ran the following script:

#!/usr/bin/env bash
while true; do
    pkill -c ps
    sleep 0.5
done

Doing so brought the load average down from ~100 to ~30 which implies that the ps commands are related to the high load average. Unfortunately we haven’t been able to find a way to tune the frequency of the ps commands.

Next Steps

Has anyone else had a similar issue and have any suggestions?
Try turning off Passenger process metrics collector in order to confirm or reject the idea that it is behind the high load average.
Find a way to reduce the frequency of the ps calls.

Notes

This test represents atypical behavior. Like mentioned in the OOD docs most user’s will visit the dashboard, start a job, and leave the dashboard. This test could reflect the load that we would expect to see with X users visiting the dashboard within 5 minutes.

Ruborcalor · August 21, 2020, 6:54pm

I was only able to put two links in my previous post. Here are some more:

efranz · August 29, 2020, 6:43pm

I fixed the link issue - everyone should be default trust 1 so this restriction is no longer there. The Discourse defaults were too restricting.

I don’t have any concrete feedback yet on the load test except that:

It is awesome
I wonder how easy it would be to switch to basic auth for the load test
Will definitely look into Passenger process metrics collector and see how it impacts performance

I will think more about all of this and follow up.

Ruborcalor · August 31, 2020, 8:24pm

Great sounds good! I should clarify that the load tests we developed are tailored to LTI authentication, not OIDC. That being said it shouldn’t be especially difficult to switch to basic auth or any other auth. The locust framework makes use of the python-requests library for carrying out load tests, so switching the test to another authentication system would be as difficult as authenticating with python-requests. When I was figuring out how to authenticate LTI with python-requests, I used the Firefox Network Monitor to watch the requests being sent from my browser when logging in and tried to replicate them.

Here the part of the load testing script relevant to authentication:

# NOTE this is where you make the necessary requests to authenticate your user
# NOTE this step will be specific to your authentication system
@task
def authenticate_user(self):

    # lti verification
    self.logger.debug("Posting to lti parse")
    response = self.client.post("/lti/launch", data=self.lti_request_data)

    # waiting page
    self.logger.debug("Visiting the waiting page")
    self.client.get(f"/verify_account_request/verify_account.html")

    # poll for success status of account before continuing, performing the same task that javascript would have on the waiting page
    while True:
        status_response = self.client.get(f"/verify_account_request/status")
        self.logger.info(status_response.json())

        if status_response.json()["status"] == "true":
            break

        time.sleep(10)

The waiting page is specific to our LTI authentication implementation, so you should be able to reduce it down to something like the following for basic auth:

# NOTE this is where you make the necessary requests to authenticate your user
# NOTE this step will be specific to your authentication system
@task
def authenticate_user(self):

    # lti verification
    self.logger.debug("Authenticating user")
    response = self.client.post("/login", auth=HTTPBasicAuth('user', 'pass'))

Here are some potential resources:

Other steps you might have to take:

Set up some test user accounts whose credentials can be used in the load tests

Happy to help let me know if you have any questions.

jeff.ohrstrom · January 27, 2021, 7:45pm

Thanks @Ruborcalor for your contribution! We’re moving this to feature request/roadmap so that we can work on enabling it ourselves and maybe making the changes you’ve proposed.

Topic		Replies	Views
OOD Host configuration recommendations Get Help	2	780	May 26, 2022
Best practices for on demand deployment Get Help question	6	930	May 26, 2022
Measuring OOD usage Get Help	5	791	May 26, 2022
Long loading time Get Help	5	1458	May 26, 2022
DOE OOD installations Get Help	16	840	May 4, 2022