Issues with nginx error (98: Address already in use)

All,

I’ve seen some similar discussion on the topic but so far we haven’t been able to resolve this in our environment. A handful of our OnDemand users get this error every single time they log in:

Error – /opt/ood/nginx_stage/lib/nginx_stage/views/pun_config_view.rb:110: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call
/opt/ood/nginx_stage/lib/nginx_stage/configuration.rb:244: warning: The called method `app_config_path’ is defined here
nginx: [emerg] bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
nginx: [emerg] bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
nginx: [emerg] still could not bind()

This is affecting most of the users we’ve opened OnDemand up to so far (probably 3/4 of people have this issue). The unaffected folks are able to log in each time without any issue however.

I’ve tried restarting the OOD server, removing the directories for the affected users under /var/run/ondemand-nginx but that does not work either.

I see when observing htop that the passenger process starts for these affected users but then they are unable to log in with the above error.

Any ideas are greatly appreciated.

Yea this one is a bit tough to track down. We’re not quite sure why this happens. But it seems to imply that a PUN started but we’re somehow not able to recognize that the thing that’s listening to /var/run/ondemand-nginx/first.last/passenger.sock isn’t running. I suspect it may be ulimits for those users. I’d spot check that, if there’s anything special about that 1/4 of your users. Though that may just be a guess.

Other than that, you may Here are some helpers to clear folk’s PUNs. These commands should clear the files in /var/run and stop any PUN a little bit more safely than say just killing processes.

# clear everyone's PUNs
/opt/ood/nginx_stage/sbin/nginx_stage nginx_clean

# Force clean everyone's PUNs
/opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --force

# clean a PUN for the specific user jeff
/opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --user jeff 

# Force clean jeff's PUN 
/opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --user jeff --force

If this happens again, can you tell us the state of the PUN for the given user? Here’s mine now.

It seems like in this case, there’s already a PUN running but somehow we’re unable to recognize that. So info on the running PUN may be helpful in determining why we’re unable to recognize it.

bash-4.4$ sudo cat /var/run/ondemand-nginx/jeff/passenger.pid
667
bash-4.4$ ps -lf 667
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY        TIME CMD
5 S root         667       1  0  80   0 - 22931 -      15:00 ?          0:00 nginx: master process (jeff) -c /var/lib/ondemand-nginx/config/puns/jeff.conf
bash-4.4$ sudo lsof /var/run/ondemand-nginx/jeff/passenger.sock
COMMAND PID USER   FD   TYPE             DEVICE SIZE/OFF   NODE NAME
nginx   667 root    6u  unix 0x000000005d929fb6      0t0 368120 /var/run/ondemand-nginx/jeff/passenger.sock type=STREAM
nginx   668 jeff    6u  unix 0x000000005d929fb6      0t0 368120 /var/run/ondemand-nginx/jeff/passenger.sock type=STREAM
bash-4.4$ ps -lf 668
F S UID          PID    PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY        TIME CMD
5 S jeff         668     667  0  80   0 - 26575 -      15:00 ?          0:00 nginx: worker process

Do you happen to any interesting logs entries in /var/log/ondemand-nginx/error.log or /var/log/ondemand-nginx/first.last/error.log?

Jeff,

I appreciate the quick reply. I ran those commands and get the following:

[root@ondemand ondemand-nginx]# sudo cat /var/run/ondemand-nginx/first.last/passenger.pid
16591
[root@ondemand ondemand-nginx]# ps -lf 16591
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY        TIME CMD
5 S root     16591     1  0  80   0 - 29710 sigsus 10:15 ?          0:00 nginx: master process (first.last) -c /var/lib/ondemand-nginx/config/puns/first.last.conf
[root@ondemand ondemand-nginx]# sudo lsof /var/run/ondemand-nginx/first.last/passenger.sock
COMMAND   PID         USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
nginx   16591         root    8u  unix 0xffff9eb61ffb9100      0t0 12732705 /var/run/ondemand-nginx/first.last/passenger.sock
nginx   16594 first.last    8u  unix 0xffff9eb61ffb9100      0t0 12732705 /var/run/ondemand-nginx/first.last/passenger.sock
[root@ondemand ondemand-nginx]# ps -lf 16594
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY        TIME CMD
5 S first.la+ 16594 16591  0  80   0 - 32405 ep_pol 10:15 ?          0:00 nginx: worker process

I only see this in /var/log/ondemand-nginx/error.log:

[root@ondemand ondemand-nginx]# cat error.log
2021/08/17 10:00:07 [notice] 14177#0: signal process started
2021/08/17 10:12:42 [notice] 16174#0: signal process started

Then, in /var/log/ondemand-nginx/first.last/error.log:

[root@ondemand first.last]# cat error.log
[ N 2021-08-17 10:15:10.3272 16565/T1 age/Wat/WatchdogMain.cpp:1373 ]: Starting Passenger watchdog...
[ N 2021-08-17 10:15:10.3750 16568/T1 age/Cor/CoreMain.cpp:1340 ]: Starting Passenger core...
[ N 2021-08-17 10:15:10.3754 16568/T1 age/Cor/CoreMain.cpp:256 ]: Passenger core running in multi-application mode.
[ N 2021-08-17 10:15:10.3964 16568/T1 age/Cor/CoreMain.cpp:1015 ]: Passenger core online, PID 16568
2021/08/17 10:15:11 [emerg] 16617#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:11 [emerg] 16617#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:11 [emerg] 16617#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
[ N 2021-08-17 10:15:13.0904 16568/T5 age/Cor/SecurityUpdateChecker.h:519 ]: Security update check: no update found (next check in 24 hours)
2021/08/17 10:15:11 [emerg] 16617#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:11 [emerg] 16617#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:11 [emerg] 16617#0: still could not bind()
2021/08/17 10:15:15 [emerg] 16652#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:15 [emerg] 16652#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:15 [emerg] 16652#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:15 [emerg] 16652#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:15 [emerg] 16652#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:15 [emerg] 16652#0: still could not bind()
2021/08/17 10:15:19 [emerg] 16686#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:19 [emerg] 16686#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:19 [emerg] 16686#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:19 [emerg] 16686#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:19 [emerg] 16686#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:19 [emerg] 16686#0: still could not bind()
2021/08/17 10:15:23 [emerg] 16721#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:23 [emerg] 16721#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:23 [emerg] 16721#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:23 [emerg] 16721#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:23 [emerg] 16721#0: bind() to unix:/var/run/ondemand-nginx/first.last/passenger.sock failed (98: Address already in use)
2021/08/17 10:15:23 [emerg] 16721#0: still could not bind()

Does this seem to be an issue with all versions of OnDemand? I don’t think we ever saw this in 1.8 so kind of stumped. We are on 2.0.10 now. Wondering if it’s worth trying to roll back, not exactly sure what to try next.

I’d be sad if you rolled back, but clearly that’s an option you have to keep open if it’s just not working for a large section of your users. From 1.8 to 2.0 we updated the nginx binary and modified this code slightly. If you’re using pun_root_pre_hook functionality you may be hitting new code blocks. Otherwise you should be hitting the same code as 1.8 (sans the newer version of nginx).

Do you have a development/test instance that you can safely modify the source code on?

Are you able to clean the PUN then start it up manually? What’s the output look like here? Can you prepend a time to see how long it takes the PUN to start for one of those 3/4 users?

[root@3c4ffd7a2513 ~]# /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --user jeff
jeff
[root@3c4ffd7a2513 ~]# /opt/ood/nginx_stage/sbin/nginx_stage pun --user jeff

Jeff,

Unfortunately no luck when using the nginx_clean commands individually either. No matter what I’ve tried I can’t seem to get any of those users working, even restarting the server makes no difference. It’s strange because it’s always the same affected users, and always the same users who are able to get in.

We have this running on a VM but I could snapshot and make some changes to see if we could get this to work.

Sorry - you can’t force clean and start the PUN manually like this? I guess I’m trying to limit the scope of investigation from apache and the actual PUN startup. If you can’t start it manually, then apache’s fine and we can hone in on why nginx is not starting.

[root@3c4ffd7a2513 ~]# /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --user jeff
jeff
[root@3c4ffd7a2513 ~]# /opt/ood/nginx_stage/sbin/nginx_stage pun --user jeff

Jeff, sorry I misunderstood. Yes we are able to start the PUN manually like that. I started an affected user’s PUN then had him try connecting to it and he still receives the same issue as before.

Sorry - I’m held up with classroom stuff at OSC. But I would like to solve this because it’s tricky to replicate and we (maintainers) don’t know why it happens in the first place - let alone how to resolve it.

Is there any way we can meet over Zoom (or something else) and triage this? Do you have a test user available to replicate (like a johrstromtest I have at OSC). DM me with the meeting info if that’s a possibility.

In the interim - did you check ulimits for the users who this fails for? are they any different than ulimits for the users this does work for? You said you upgraded from 1.8, is it the same VM. I mean did you in place upgrade the VM or roll an entirely new machine? Lastly - are there any selinux policy issues being reported in /var/log/audit/?

Jeff,

That would be greatly appreciated. It likely won’t be until next week sometime, would that work for you? I do have the ability to create test users but our test accounts we tried didn’t have issues. I’d like to see if we could get a person or two on the team having issues to hop on the call and we can run through it.

This was an in-place upgrade from 1.8 and the same VM. We disabled selinux on this particular VM and nothing really to note in /var/log/audit.

You don’t see anything in /var/log/audit - do you happen to have SELinux enabled?

Jeff,

An update that’s kind of interesting - spun up a fresh VM with latest version of OnDemand on it and still appears to be the same error for the same users. It must be something to do with our environment then I’d assume but what, I’m not sure. As time allows I’ll try it with 1.8 and see if we get the same issue.

That’s sad news to be sure. At this point rolling back to 1.8 and keeping a VM with 2.0 to troubleshoot is your likely your best option for your user base.

I had a thought about permissions. The issue is it’s unable to recognize that socket file as existing - even though it clearly does.

Naively - I’m wondering if there’s a permission issue somehow (not clear how as it’s the apache user who needs to recognize these files and it can see yours but not others).

I believe we updated NGINX from 1.8 to 2.0 - so I’ll lookup what versions applied to what OnDemand versions.

Jeff,

Another update, we are back on 1.8 and still the same issue affecting the same users. This is another fresh install from scratch. I’m not too sure what to take a look at next or as far as permissions. If you’d have time, or if it would be helpful to you all we could send you creds to log in and take a look at these test boxes we have set up and where it’s still failing. I’ll do my best on our end but admittedly I am short on ideas right now. If you have any updates or things to try, let me know and I can report back.

Let’s spot check this directory. This is what mine looks like in my container. What I’m looking for is if the $USER directory here is 700 apache:root like mine is for an erroneous user.

[jeff@84c0064d1840 ~]$ sudo ls /var/run/ondemand-nginx/ -lrt
total 8
drwx------. 2 apache root 4096 Aug 25 19:00 jeff
[jeff@84c0064d1840 ~]$ sudo ls /var/run/ondemand-nginx/jeff/ -lrt
total 4
srw-rw-rw-. 1 root root 0 Aug 25 19:00 passenger.sock
-rw-r--r--. 1 root root 4 Aug 25 19:00 passenger.pid

Of course I’m happy to help in any way I can, but I am hesitant to request sudo privileges on your machines as a lot of this debugging would likely require it.

I’ll continue to look into how these directories and files get initialized if there’s anything that’s userbased there.

This, I think, is the big clue we have: It always affects the same users. Some users it always works for and others it never works for. Let’s follow this lead. What’s the difference between these users? What do they have in common with each other that’s different from the users that do work?

I’m also wondering about the SSSD/LDAP change you had to make. Can these users login when that change is reverted (even if it takes ~2-3 minutes)?

Jeff, I was thinking that too. I rolled back the changes we made to SSSD and restarted the service and the issue is still there. It’s really strange.

Jeff, so after some tinkering we moved from ADFS to basic LDAP auth and now it’s working. I have no idea why right now, but it seems to be something going on with our ADFS setup then. Everyone who had issues before can now sign in that it’s basic auth. I’ll see if our ADFS admin can pull any logs, but thought I’d update you on what we found out.

1 Like