Latency issues and timeout in web application

All,

We have been having some latency issues with most accounts when signing in to the Open OnDemand portal. We are using LDAP/AD accounts for auth. Usually it times out and gives me an error:

Details:
Web application could not be started by the Phusion Passenger application server.
Please read the Passenger log file (search for the Error ID) to find the details of the error.

I seem to have more problems with my AD account than any but it’s really intermittent and I can’t seem to peg it down. I was able to sign in with a service account I created in order to do LDAP bind, just for testing purposes, I was signed into the portal within 10 seconds but as soon as I attempted to launch Active Jobs it loaded until it timed out. I rarely have any luck with my own AD account, on initial sign in it timed out several times and let me in the 5th time. Oddly enough, I almost never have issues with another test account - it works flawlessly. Even when opening Job Composer, shell, etc.

I don’t know what parts of the error log are relevant. I don’t know that this is a resource/network issue, the site is hosted on a VM running on top of local storage on an ESXI server with plenty of resource allocation. Pasting the diagnostics part of the log from one of my failed attempts to log in earlier, if any other part of this would be helpful please let me know and I can provide the entire log or anything else that would be helpful. Much appreciated.

Bl “system_wide” : {
“system_metrics” : “------------- General -------------\nKernel version : 3.10.0-1062.el7.x86_64\nUptime : 3d 19h 7m 45s\nLoad averages : 0.21%, 0.23%, 0.13%\nFork rate : unknown\n\n------------- CPU -------------\nNumber of CPUs : 8\nAverage CPU usage : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 1 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 2 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 3 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 4 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 5 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 6 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 7 : 0% – 0% user, 0% nice, 0% system, 100% idle\n CPU 8 : 0% – 0% user, 0% nice, 0% system, 100% idle\nI/O pressure : 0%\n CPU 1 : 0%\n CPU 2 : 0%\n CPU 3 : 0%\n CPU 4 : 0%\n CPU 5 : 0%\n CPU 6 : 0%\n CPU 7 : 0%\n CPU 8 : 0%\nInterference from other VMs: 0%\n CPU 1 : 0%\n CPU 2 : 0%\n CPU 3 : 0%\n CPU 4 : 0%\n CPU 5 : 0%\n CPU 6 : 0%\n CPU 7 : 0%\n CPU 8 : 0%\n\n------------- Memory -------------\nRAM total : 15884 MB\nRAM used : 644 MB (4%)\nRAM free : 15240 MB\nSwap total : 8063 MB\nSwap used : 0 MB (0%)\nSwap free : 8063 MB\nSwap in : unknown\nSwap out : unknown\n\n”
}
},
“error” : {
“category” : “TIMEOUT_ERROR”,
“id” : “27b3fa1c”,
“problem_description_html” : “

The Phusion Passenger application server tried to start the web application, but this took too much time, so Passenger put a stop to that.

”,
“solution_description_html” : “<div class=“multiple-solutions”>

Check whether the server is low on resources

Maybe the server is currently so low on resources that all the work that needed to be done, could not finish within the given time limit. Please inspect the server resource utilization statistics in the diagnostics section to verify whether server is indeed low on resources.

If so, then either increase the spawn timeout (currently configured at 90 sec), or find a way to lower the server’s resource utilization.

Still no luck?

Please try troubleshooting the problem by studying the diagnostics reports.

”,
“summary” : “A timeout occurred while spawning an application process.”
},
“journey” : {
“steps” : {
“SPAWNING_KIT_FINISH” : {
“state” : “STEP_NOT_STARTED”
},
“SPAWNING_KIT_FORK_SUBPROCESS” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36493100000001,
“timestamp” : 1580749283.0021019
},
“duration” : 0.001,
“end_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36393100000001,
“timestamp” : 1580749283.0031021
},
“state” : “STEP_PERFORMED”
},
“SPAWNING_KIT_HANDSHAKE_PERFORM” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36393100000001,
“timestamp” : 1580749283.0031021
},
“duration” : 90.049999,
“end_time” : {
“local” : “Mon Feb 3 11:02:53 2020”,
“relative” : “1m 30s ago”,
“relative_timestamp” : -90.313931999999994,
“timestamp” : 1580749373.0531011
},
“state” : “STEP_ERRORED”
},
“SPAWNING_KIT_PREPARATION” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36593099999999,
“timestamp” : 1580749283.001102
},
“duration” : 0.001,
“end_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.36493100000001,
“timestamp” : 1580749283.0021019
},
“state” : “STEP_PERFORMED”
},
“SUBPROCESS_APP_LOAD_OR_EXEC” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_BEFORE_FIRST_EXEC” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.35193100000001,
“timestamp” : 1580749283.0151019
},
“duration” : 0.0,
“end_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.35193100000001,
“timestamp” : 1580749283.0151019
},
“state” : “STEP_PERFORMED”
},
“SUBPROCESS_EXEC_WRAPPER” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_FINISH” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_LISTEN” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_OS_SHELL” : {
“state” : “STEP_NOT_STARTED”
},
“SUBPROCESS_SPAWN_ENV_SETUPPER_AFTER_SHELL” : {
“begin_time” : {
“local” : “Mon Feb 3 11:02:23 2020”,
“relative” : “2m 0s ago”,
“relative_timestamp” : -119.995931,
“timestamp” : 1580749343.3711021
},
“state” : “STEP_IN_PROGRESS”
},
“SUBPROCESS_SPAWN_ENV_SETUPPER_BEFORE_SHELL” : {
“begin_time” : {
“local” : “Mon Feb 3 11:01:23 2020”,
“relative” : “3m 0s ago”,
“relative_timestamp” : -180.35193100000001,
“timestamp” : 1580749283.0151019
},
“duration” : 60.343000000000004,
“end_time” : {
“local” : “Mon Feb 3 11:02:23 2020”,
“relative” : “2m 0s ago”,
“relative_timestamp” : -120.008931,
“timestamp” : 1580749343.3581021
},
“state” : “STEP_PERFORMED”
},
“SUBPROCESS_WRAPPER_PREPARATION” : {
“state” : “STEP_NOT_STARTED”
}
},
“type” : “SPAWN_DIRECTLY”
},
“program_name” : “Phusion Passenger”,
“short_program_name” : “Passenger”
}
;

Doesn’t look like anything to me. Indeed you’re at 100% idle and ~0.25 load average says it’s not a resource issue.

I saw you say it’s local storage, but do you have any NFS mounts enabled?

At this point (where you’re failing at) you’ve already authenticated, so how different AD accounts behave differently is strange. Maybe it’s a timing thing? Like do you ever use the test account (the one that works flawlessly) first? Or… do you use you’re own a few times, then use the test account? Maybe it’s not the different in accounts, but in time - meaning something warmed up and got cached and that’s why the test user is less flaky.

More information is required I’m afraid. I think a pstack may while it idles may help a lot. Also, there could be something in dmesg or /var/log/messages or journalctl. Some system level log may have something?

Try to start a session, login to the machine and get a pstack of that process.

Here are the processes on my machine for reference. You may not see all of them because you’re still in the boot process, but I think we’re looking for 311 (Passenger watchdog) in this example, but if you have 314 (Passenger Core) get it too. Hopefully that will shed a some light on what it’s waiting on.

4 S jeff       311     1  0  80   0 - 91556 -      17:13 ?        00:00:00 Passenger watchdog
0 S jeff       314   311  0  80   0 - 389004 -     17:13 ?        00:00:01 Passenger core
5 S root       333     1  0  80   0 - 25556 -      17:13 ?        00:00:00 nginx: master process (jeff) -c /var/lib/ondemand-nginx/config/puns/jeff.conf
5 S jeff       334   333  0  80   0 - 25557 -      17:13 ?        00:00:00 nginx: worker process

Actually, thinking about this a second time maybe it does have something to do with LDAP?

You could try switching your LDAP auth to use the users’ credentials instead of the test ones. It may or may not affect anything or have any difference, but it’s at least an avenue to test and rule AD caching out. If there is something there you could turn mod_authnz_ldap debug logs on to see timings better.

use

AuthLDAPSearchAsUser on

instead of

AuthLDAPBindDN someTestUser
AuthLDAPBindPassword someTestPassword