Latency issues and timeout in web application

All,

We have been having some latency issues with most accounts when signing in to the Open OnDemand portal. We are using LDAP/AD accounts for auth. Usually it times out and gives me an error:

Details:
Web application could not be started by the Phusion Passenger application server.
Please read the Passenger log file (search for the Error ID) to find the details of the error.

I seem to have more problems with my AD account than any but it’s really intermittent and I can’t seem to peg it down. I was able to sign in with a service account I created in order to do LDAP bind, just for testing purposes, I was signed into the portal within 10 seconds but as soon as I attempted to launch Active Jobs it loaded until it timed out. I rarely have any luck with my own AD account, on initial sign in it timed out several times and let me in the 5th time. Oddly enough, I almost never have issues with another test account - it works flawlessly. Even when opening Job Composer, shell, etc.

I don’t know what parts of the error log are relevant. I don’t know that this is a resource/network issue, the site is hosted on a VM running on top of local storage on an ESXI server with plenty of resource allocation. Pasting the diagnostics part of the log from one of my failed attempts to log in earlier, if any other part of this would be helpful please let me know and I can provide the entire log or anything else that would be helpful. Much appreciated.

Bl “system_wide” : {
“system_metrics” : “------------- General -------------\nKernel version : 3.10.0-1062.el7.x86_64\nUptime : 3d 19h 7m 45s\nLoad averages : 0.21%, 0.23%, 0.13%\nFork rate : unknown\n\n------------- CPU -------------\nNumber of CPUs : 8\nAverage CPU usage : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 1 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 2 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 3 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 4 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 5 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 6 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 7 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 8 : 0% – 0% user, 0% nice, 0% system, 100% idle\nI/O pressure : 0%\n CPU 1 : 0%\n CPU 2 : 0%\n CPU 3 : 0%\n CPU 4 : 0%\n CPU 5 : 0%\n CPU 6 : 0%\n CPU 7 : 0%\n CPU 8 : 0%\nInterference from other VMs: 0%\n CPU 1 : 0%\n CPU 2 : 0%\n CPU 3 : 0%\n CPU 4 : 0%\n CPU 5 : 0%\n CPU 6 : 0%\n CPU 7 : 0%\n CPU 8 : 0%\n\n------------- Memory -------------\nRAM total : 15884 MB\nRAM used : 644 MB (4%)\nRAM free : 15240 MB\nSwap total : 8063 MB\nSwap used : 0 MB (0%)\nSwap free : 8063 MB\nSwap in : unknown\nSwap out : unknown\n\n”
}
},
“error” : {
“category” : “TIMEOUT_ERROR”,
“id” : “27b3fa1c”,
“problem_description_html” : “

The Phusion Passenger application server tried to start the web application, but this took too much time, so Passenger put a stop to that.

”,
“solution_description_html” : “<div class=“multiple-solutions”>

Check whether the server is low on resources

Maybe the server is currently so low on resources that all the work that needed to be done, could not finish within the given time limit. Please inspect the server resource utilization statistics in the diagnostics section to verify whether server is indeed low on resources.

If so, then either increase the spawn timeout (currently configured at 90 sec), or find a way to lower the server’s resource utilization.

Still no luck?

Please try troubleshooting the problem by studying the diagnostics reports.

”,
“summary” : “A timeout occurred while spawning an application process.”
},
“journey” : {
“steps” : {
“SPAWNING_KIT_FINISH” : {
“state” : “STEP_NOT_STARTED”
},
“SPAWNING_KIT_FORK_SUBPROCESS” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36493100000001,
“timestamp” : 1580749283.0021019
},
“duration” : 0.001,
“end_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36393100000001,
“timestamp” : 1580749283.0031021
},
“state” : “STEP_PERFORMED”
},
“SPAWNING_KIT_HANDSHAKE_PERFORM” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36393100000001,
“timestamp” : 1580749283.0031021
},
“duration” : 90.049999,
“end_time” : {
“local” : “Mon Feb 3 11:02:53 2020”,
“relative” : “1m 30s ago”,
“relative_timestamp” : -90.313931999999994,
“timestamp” : 1580749373.0531011
},
“state” : “STEP_ERRORED”
},
“SPAWNING_KIT_PREPARATION” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36593099999999,
“timestamp” : 1580749283.001102
},
“duration” : 0.001,
“end_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36493100000001,
“timestamp” : 1580749283.0021019
},
“state” : “STEP_PERFORMED”
},
“SUBPROCESS_APP_LOAD_OR_EXEC” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_BEFORE_FIRST_EXEC” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.35193100000001,
“timestamp” : 1580749283.0151019
},
“duration” : 0.0,
“end_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.35193100000001,
“timestamp” : 1580749283.0151019
},
“state” : “STEP_PERFORMED”
},
“SUBPROCESS_EXEC_WRAPPER” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_FINISH” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_LISTEN” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_OS_SHELL” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_SPAWN_ENV_SETUPPER_AFTER_SHELL” : {
“begin_time” : {
“local” : “Mon Feb 3 11:02:23 2020”,
“relative” : “2m 0s ago”,
“relative_timestamp” : -119.995931,
“timestamp” : 1580749343.3711021
},
“state” : “STEP_IN_PROGRESS”
},
“SUBPROCESS_SPAWN_ENV_SETUPPER_BEFORE_SHELL” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.35193100000001,
“timestamp” : 1580749283.0151019
},
“duration” : 60.343000000000004,
“end_time” : {
“local” : “Mon Feb 3 11:02:23 2020”,
“relative” : “2m 0s ago”,
“relative_timestamp” : -120.008931,
“timestamp” : 1580749343.3581021
},
“state” : “STEP_PERFORMED”
},
“SUBPROCESS_WRAPPER_PREPARATION” : {
“state” : “STEP_NOT_STARTED”
}
},
“type” : “SPAWN_DIRECTLY”
},
“program_name” : “Phusion Passenger”,
“short_program_name” : “Passenger”
}
;

Doesn’t look like anything to me. Indeed you’re at 100% idle and ~0.25 load average says it’s not a resource issue.

I saw you say it’s local storage, but do you have any NFS mounts enabled?

At this point (where you’re failing at) you’ve already authenticated, so how different AD accounts behave differently is strange. Maybe it’s a timing thing? Like do you ever use the test account (the one that works flawlessly) first? Or… do you use you’re own a few times, then use the test account? Maybe it’s not the different in accounts, but in time - meaning something warmed up and got cached and that’s why the test user is less flaky.

More information is required I’m afraid. I think a pstack may while it idles may help a lot. Also, there could be something in dmesg or /var/log/messages or journalctl. Some system level log may have something?

Try to start a session, login to the machine and get a pstack of that process.

Here are the processes on my machine for reference. You may not see all of them because you’re still in the boot process, but I think we’re looking for 311 (Passenger watchdog) in this example, but if you have 314 (Passenger Core) get it too. Hopefully that will shed a some light on what it’s waiting on.

4 S jeff       311     1  0  80   0 - 91556 -      17:13 ?        00:00:00 Passenger watchdog
0 S jeff       314   311  0  80   0 - 389004 -     17:13 ?        00:00:01 Passenger core
5 S root       333     1  0  80   0 - 25556 -      17:13 ?        00:00:00 nginx: master process (jeff) -c /var/lib/ondemand-nginx/config/puns/jeff.conf
5 S jeff       334   333  0  80   0 - 25557 -      17:13 ?        00:00:00 nginx: worker process

Actually, thinking about this a second time maybe it does have something to do with LDAP?

You could try switching your LDAP auth to use the users’ credentials instead of the test ones. It may or may not affect anything or have any difference, but it’s at least an avenue to test and rule AD caching out. If there is something there you could turn mod_authnz_ldap debug logs on to see timings better.

use

AuthLDAPSearchAsUser on

instead of

AuthLDAPBindDN someTestUser
AuthLDAPBindPassword someTestPassword

Hi Jeff,

My apologies for the huge delay with getting back to you, I have not looked at this problem in a while and am just getting back into it. I really appreciate your suggestions. I was playing around a bit with this today.

I started with a fresh reboot of the server and starting up the Apache service.
Then, I signed in with a known “good” user - I just created a new user under our Research Computing OU today. I am logged in immediately. When I flip over to htop and take a look at the processes, I see multiple Passenger Core and Watchdog processes for “good.user” appear immediately.

Then, as a test, I rebooted the server again to gain a fresh slate, and signed in with a user I know I have trouble with (My AD account, for one). I signed in then watched htop. It took upwards of 30 seconds-1 minute for the Passenger Core and Watchdog processes to appear under my username. Once they do appear in htop, I’m still waiting for OnDemand to load in my browser. I can wait for it in another window for a few minutes and it will eventually return with the above error.

When I tested it with the bad account, the only related processes I see are:

root		id -nG bad.user
root 		ruby -I/opt/ood/nginx_stage/lib -rnginx_stage -e NginxStage::Application.start -- pun -u bad.user -a https%3a%2f%2fondemand.domain.edu%3a443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri
root		sudo /opt/ood/nginx_stage/sbin/nginx_stage pun -u bad.user -a https%3a%2f%2fondemand.domain.edu%3a443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri
root		sh -c sudo /opt/ood/nginx_stage/sbin/nginx_stage pun -u 'bad.user' -a 'https%3a%2f%2fondemand.domain.edu%3a443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri' 2>&1

I tried looking in the logs and I don’t see anything out of the ordinary but I will go back and comb through them again. I ran a tcpdump and I don’t know if it’s much relevance but on the “bad.user” accounts I see a lot of jumping between our ldap servers whereas with the “good” account it contacts our ldap server, returns immediately and works. However, like you said the issues seem to be happening after I’ve authenticated already, so it’s odd. I thought it might have been an LDAP referral issue but now I’m not so sure. When I have colleagues try it, the same thing happens but when I try it with fresh test accounts it seems to always work just fine. I didn’t return anything with pstack when I tried so that’s why I flipped over to htop.

When I do an ldapwhoami or ldapsearch I am able to return the info on the “bad” accounts immediately so I don’t think it’s an issue with my LDAP config. I had trouble with the SearchAsUser option but I tried the ldap search binding with my own credentials and was successsful.

There are no NFS mounts on the server. I did have /home mounted via NFS at one point, but since I’ve run into this issue I’ve put that on the back burner for now and I’ve been using /home directly on OnDemand.

I’m looking at getting this tied in with another auth method like CAS and hopefully that resolves the problem. And we have been going in under the “good” account to do app development. Everything else is set up and looks great. Once we get past this issue I think we’ll be in business and can move it to production.

Hi, I’m sorry for my delay as well.

The fact that id -nG bad.user is in your notes gives me pause. That hits ldap too. I’d be interested in seeing how long it takes to complete when you run that command from a shell (try as a regular user a couple times first then as root if it’s not slow) . That command should come back in almost no time, if it doesn’t then that’s our ticket.

If it does take a long time, you can strace it like this and you’ll see where all your time is being taken.
strace -ttt id -nG johrstrom > strace.log 2>&1

@jeff.ohrstrom, thanks a lot for your reply, we have been looking around at this more and have definitely found that we are running into problems with these accounts that have several group memberships. As you mentioned, ‘id’ returns immediately for the newer accounts and is pretty slow for the established accounts we’ve been having issues with.

It looks like we have some more tuning to do and we’ll be on our way to getting this resolved.

1 Like