OOD Rstudio users encountering HTTP 502 and 503 errors during sessions

Hello,

One of my users reached out to me yesterday about having his Rstudio sessions through OOD interrupted with 502 and 503 HTTP errors.

Their description of the issue:

The last few days I am often getting a 502 or 503 error message pop up - I assume it has something to do with the connection to the cluster used by OOD? When it happens I can’t write to my open files, interact, etc. Sometimes it seems to resolve itself and sometimes I need to get out of rstudio server and then back in…My entire work area becomes grayed out and I can’t interact with it all. The function I called seemed to have hung (?).

502error

It does look like httpd24 on our OOD server is logging these errors:

10.192.16.15 - ll36w [06/Nov/2019:17:46:02 -0500] “POST /rnode/c40b04/38685/events/get_events HTTP/1.1” 200 46 “https://ood:444/rnode/c40b04/38685/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36”
10.192.16.15 - ll36w [06/Nov/2019:17:46:27 -0500] “POST /rnode/c40b04/38685/rpc/check_for_external_edit HTTP/1.1” 502 472 “https://ood:444/rnode/c40b04/38685/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36”
10.192.16.15 - ll36w [06/Nov/2019:17:46:40 -0500] “POST /rnode/c40b04/38685/rpc/set_client_state HTTP/1.1” 502 458 “https://ood:444/rnode/c40b04/38685/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36”
10.192.16.15 - ll36w [06/Nov/2019:17:46:40 -0500] “POST /rnode/c40b04/38685/rpc/set_workbench_metrics HTTP/1.1” 502 468 “https://ood:444/rnode/c40b04/38685/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36”
10.192.16.15 - ll36w [06/Nov/2019:17:46:52 -0500] “POST /rnode/c40b04/38685/events/get_events HTTP/1.1” 200 46 “https://ood:444/rnode/c40b04/38685/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36”

And I see similar ones occuring for other users, apparently with Rstudio as well. I don’t see any issues in the system-level logs for the processing nodes running these sessions.

The output.log for the most recent session doesn’t show anything after “Generating connection YAML file…”, and the rsession.log has nothing after the “+ exec rsession --r-libs-user…”

I suspect that the issue is with the Rstudio process itself potentially being unresponsive, perhaps while running something computationally difficult, such that requests through the proxy server time out while awaiting a response?

I’ve been trying to determine if there’s an option to tell Rstudio to do more verbose logging in the rsession.log, but so far haven’t found anything.

My hope is that someone else may have seen similar issues and has some advice.

Thanks very much.

Managed to catch an Rstudio process while it’s having problems, strace shows the rsession process apparently looping and taking ~99% of a cpu:

[pid 240319] 11:55:36.410681 <… futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 240319] 11:55:36.410730 futex(0x2b88800008f0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 240319] 11:55:36.410794 stat("/home/ll36w/.rstudio/monitored", {st_mode=S_IFDIR|0755, st_size=116, …}) = 0
[pid 240319] 11:55:36.410873 read(7, 0x2b887dd58d10, 180000) = -1 EAGAIN (Resource temporarily unavailable)
[pid 240319] 11:55:36.410932 futex(0x2b888000091c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 27813, {1573145736, 660917000}, ffffffff <unfinished …>

The read failure is trying to access an inotify filehandle; based on some quick searching, it sounds like inotify doesn’t really work with NFS by default. I’m not sure if this is an inotify/nfs issue, or perhaps a futex issue. So far I haven’t found anything about how Rstudio server uses inotify.

Glad to see you figured out your problem @cjh or at least are getting close. Is it just RStudio that’s timing out and unable to read these files or do other apps have trouble with your NFS?

Let us know if you’ve figured it out!

Thanks Jeff, so far only one user has reported the issue, and only while using RStudio (I don’t know that they use any other applications in OOD so far). From their further emails, it sounds like the problem may crop up when they kick off intensive functions.

My working theory at the moment is that the RStudio process may be too busy to respond to HTTP requests before a timeout? I haven’t investigated whether there’s an independent thread handling the web requests, but running the job with more than one processor doesn’t appear to have made any difference for the user.

It seems like my earlier theory that RStudio was caught in a loop is probably not what was happening; it seems like the inotify system calls are repeated whether Rstudio is being responsive or not, so that avenue of investigation seems not to be helpful.

That sounds likely. The screenshot you’ve given is from Rstudio itself and not from OnDemand (nginx or httpd). Plus you indicate that the CPU is at near 99%. I’d take a look at the load average during this time too. In either case, when you hear hooves, think horses not zebras. Which is to say - RStudio just being too busy to respond seems more likely than an mishandling inodes.

I tried for a little bit to get the log directory to mount to a local directory but couldn’t get it to work in the short time I’d put into it. Maybe that’s an avenue for you? Maybe RStudio will actually log why it returns 500s.

I had a user report this problem to me as well. He is doing all types of aweful things like running out of memory etc. In the end though, I am guessing his problem is that he is using parallel in Rstudio, which by the docs, you aren’t supposed to do.

1 Like