Socket error: Address already in use

When launching OOD apps we occasionally get socket errors along the lines of “Address already in use”. This seems to be random and simply relaunching the app works. We were wondering if these occasional errors are to be expected and are they just a result of how OOD assigns port numbers to the app? Is there a suggested workaround to this issue? Not a big deal since these errors are a rare occurrence for us (more curious than anything).

Hi and welcome!

It’s strange that you’ve encountered it enough to open this topic. What version of OOD are you running, just for reference? I just checked this and found that we indeed check to see if the port’s already open. So we somehow determined that 42652 was indeed open before we suggested to use it. Though re-reading this, there could be a bug in the way we’re determining if a socket is currently open or not. Thanks for reaching out!

OK I checked in on it again, and it seems our logic is sound, though the fact you’ve ran into this more than once or twice seems to indicate it’s not. I’ll keep digging.

Thanks for your response! We are running OnDemand version v1.7.14 (which we think is great by the way). I don’t have exact numbers but we’ve seen this maybe a handful of times over the past few months so it’s not a major issue for us. We were also wondering if others had seen or experienced this issue and how common it is; but it sounds very uncommon. Is it possible for two OOD apps to get assigned the same port number/address (for instance if they start at the exact same time on the same compute node)? Is that what you think happened here?

Just wanted to let you know that I spoke with our SLURM Admin and it does not appear there was another OOD app job that started at the same time on the same compute node as the job for the OOD app that had the socker error. So that wasn’t the issue here at least.

Is it possible that there was a socket file that got left behind by a job that Slurm deleted? I think that is possible, and if the file corresponding to the socket is still there, that might cause such an error?

Sorry if I am parading my ignorance in public if that is not possible.

Not at all! Cluster admin here from the querying site – the poster and I pondered this a while today.

Is it possible that there was a socket file that got left behind by a job that Slurm deleted? I think that is possible, and if the file corresponding to the socket is still there, that might cause such an error?

That sounds like an interesting idea; I am not the primary on the OOD setup but I’m still curious. (* they are out this week *)
Any suggestions how I might check for that, or leads on how that could be investigated?

I have another theory - Is it possible that OOD is terminating the incorrect Xvnc process? Terminating itself? It seems that in every case of the later fail message there is this Xvnc process already killed message.

Comparing a “good” and “bad” session for the same user, in the “Bad” case the vnc.log stops at this line:
05/10/2020 16:18:25 VNC extension running!
with no additional info.
Non-working case:

Setting VNC password…
Starting VNC server…
Killing Xvnc process ID 20697
Xvnc process ID 20697 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X10
Xvnc did not appear to shut down cleanly. Removing /tmp/.X10-lock

Desktop ‘TurboVNC: xxxx’ started on display xxxx

Log file is vnc.log
Successfully started VNC server on c130401-ood.ll.unc.edu:5913…
Script starting…
Starting websocket server…

etc.

Working case:

Setting VNC password…
Starting VNC server…

Desktop 'TurboVNC: xxxxx started on display xxxxx
Log file is vnc.log
Successfully started VNC server on xxxxx
Script starting…
Starting websocket server…
Launching desktop ‘mate’…
WebSocket server settings:

  • Listen on :62487
  • Flash security policy server
  • No SSL/TLS support (no cert file)

Ah - I see it now – /tmp/.X* stuff

Alas, I don’t really. You’ve discovered the obvious VNC X socket file(s),

Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X10
Xvnc did not appear to shut down cleanly. Removing /tmp/.X10-lock

the numbers will change. I don’t know of any other sockets that might get created by the web server or other processes nor what their files might be called. It seemed like it might be worth asking.

The only other thing I could think of is that there is a conditional that is getting tripped for ‘something else’ that causes it to check on the socket just created and reject it because it is there…

Good luck finding the ghost in the machine!