Interactive App completed, slurm job remains active

A researcher had a desktop session open and working in terminal. They closed their laptop and moved to a new location. Upon opening, the dashboard resumed, while the desktop interactive app registered as ‘completed’ in the My Interactive Apps view.

Evaluating the job status in slurm, the job supporting the interactive desktop was still running.

I could use some help understanding where to look to debug/understanding the situation.
Thanks
~ Em

I would look in /var/log/ondemand-nginx/$USER/error log for squeue (if you’re running Slurm) failure command output.

My guess is, there was some error while OOD was trying determine if the job was still running, and defaults so saying it’s complete if there was an error. I’m not sure why moving caused this off hand.

When the researcher reconnected, OOD would have queried for the job. But the fact that the researcher has a new IP or whatever, should have had no affect on this query. I’m guessing (wildly) that it was just coincidence. The job shouldn’t have been marked ‘completed’, but I don’t think moving caused this. I think it was some error on the OOD server itself and it just happened to be at the same time the researcher moved. Had they stayed, they may have saw the same thing.

Hi, Jeff –

Indeed, errors in /var/log/ondemand-nginx/error.log-20201113.gz
open() “/var/run/ondemand-nginx/ljr61/passenger.pid” failed (2: No such file or directory)

And this is the researcher account reporting the problem.
That researcher also still has two “nginx: master” processes active on the ondemand node.
I was thinking that after receiving confirmation that they are finished using ondemand, I would check again and remove any lingering “master” processes. OOD seems to do a nice job on clean-up, nonetheless, being hands on in this case may be useful.

The only other information that may be worth sharing is that the researcher is using a Mac. I don’t know if the Safari cautions that I recall from early releases still apply. This info is taken from /var/log/httpd24/ and the current _error_ssl.log:
req_user_agent=“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Safari/605.1.15”

Thanks

I’ve hesitated to act concerning the account having two nginx master processes. nginx_stage nginx_list only seems to reflect the accounts, and not the number of nginx master processes.

What is the graceful way to stop the older (orphaned?) nginx master process?

/opt/ood/nginx_stage/sbin/nginx_stage nginx_clean --force --user <theusername> should clear out the valid PUN and maybe even the invalid one. I’m thinking you should get rid of the good and the bad just to reset everything.

I did a bit of testing and this is what I came up with. Here’s the process tree for this example.

4 S jeff         590       1  1  80   0 - 87045 -      20:09 ?        00:00:00 Passenger watchdog
0 S jeff         593     590  1  80   0 - 378594 -     20:09 ?        00:00:00 Passenger core
5 S root         603       1  0  80   0 - 23059 -      20:09 ?        00:00:00 nginx: master process (jeff) -c /var/lib/ondemand-nginx/config/puns/jeff.conf
5 S jeff         613     603  0  80   0 - 26701 -      20:09 ?        00:00:00 nginx: worker process
0 S jeff         617     593 56  80   0 - 78211 -      20:09 ?        00:00:02 ruby /opt/rh/ondemand/root/usr/share/passenger/helper-scripts/rack-loader.rb

I killed 603 (with just regular kill which uses SIGTERM by default, not forced with -9 SIGKILL).

That stopped nginx, but also set my Passenger processes’ to clean themselves up. I think that’s the important bit, to tell the process stacks to stop rather than wiping them all out by force. They get a chance to clean themselves up.

You can see here the watchdog turned from pid 590 to 659 cleaning up.

0 S jeff         593       1  0  80   0 - 371853 -     20:09 ?        00:00:00 Passenger core
1 S jeff         659       1  0  80   0 - 87045 -      20:09 ?        00:00:00 PassengerWatchdog (cleaning up...)

At that point, everything was gone and I could safely reconnect.

I think in the very worst case, if you force killed everything that user had on the server you’d still be OK, but you may have to also remove the socket file (/var/run/ondemand-nginx/jeff/passenger.sock for me, you can see my username there) and let everything come back afterwards. There’s very little state held within OOD (it queries the scheduler for most things) so there’s not much, if anything, to lose in stopping all the processes.

Thanks, Jeff. This is helpful (even if I feel I’ve asked you this before). I’m coordinating with the researcher so as to avoid clearing out the active nginx master process. Your response 5 does allow me to proceed with removing this detached nginx:master.

Speaking of detached, can you think of any way to “recover” a slurm jobid to ondemand? As a reminder, the researcher had an interactive desktop running associated with the nginx:master that is now “adrift”. We’ve implemented long duration sessions through ood, and there was a correspondingly long-running process at work in the “untethered” slurm job.

I’ve tried to learn more about nginx and passenger processes, but am not yet able to follow on my ood node the processes allocated to support various passenger and interactive apps. Ideally, I’d like to see using ‘ps aux’ (or using a more appropriate set of flags) a tree of processes associated with the requests from each account.

Any guidance would be appreciated – thanks again for the information you’ve already shared!

Sure you can recover it given some detective work. Of course after typing all of this out I now realize you may just have a backup of this file if you do routine backups. If you do, then it’s option as well. Read through the instructions and you’ll see which file you’re looking for that you can restore.

Here’s how:

  1. First you need to find the UUID of the session the researcher lost.
    Let’s imagine we’re looking for a session I lost from our app bc_osc_jupyter. I’d look in /users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_jupyter/output (my own home directory, but there’s the app name in the path). Here you’ll see a list of sub-directories that are all UUIDs. I’m guessing it’s the last one, but you may have to ask or determine which one it is by the timestamp of the files. The job_script_options.json or user_defined_context.json may give you clues as to which one it is. Whichever directory is the session you want to recover, that’s the UUID we’ll use moving forward.
  2. Launch the same app yourself
  3. You’ll see a file written to ~/ondemand/data/sys/dashboard/batch_connect/db. It’s some UUID, your session for your app launch.
  4. copy this file to some other location while you edit it, and rename it the UUID we’ve recovered in step 1. Copying it to another location will ensure OOD doesn’t do anything with it while you’re editing. You’ll notice that it’s json, so it’s safe to pretty print the new file to help you edit.
  5. Change the id field to be the same UUID we picked up in #1
  6. Change the created_at time to be your best guess (you can probably pick this time up from the scheduler’s submit time).
  7. Change the job_id to be the correct job_id you want to recover.
  8. Copy this new edited file to the researchers ~/ondemand/data/sys/dashboard/batch_connect/db keeping the same filename (the UUID found in step 1). Also make sure it has 644 permissions and the same owner:group their other files have.

At that point, the researcher’s OOD should see this file and show it as an active app.

OH! Sorry, I just realized you had initially said the job had been ‘completed’, so the file in the researchers directory may still exist. If that’s the case there’s no need to go through all the trouble of digging up what UUID session it is and so on.

If the card that says ‘completed’ and the file is still there, then it’s easy to recover it. Simply edit the file so that "cache_completed": null (it’s likely set to true right now). Once OOD sees this it’ll start to query Slurm for the job and see that it’s still active.

Hi, Jeff –

No, no – thanks on both accounts! I’ve just worked through the earlier email showing details of how to proceed.
I really appreciate the guided tour approach, as it helps to bring together some of disconnected bits of ‘knowledge’ gained from trawling through my own ‘user’ ood file structure over time.

To paraphrase the idea for how to recover ood control of a running job id,
– when there are files remaining in ~/ondemand/data/sys/dashboard/batch_connect/db/, the connection can be revived directly through edit of that existing db file.
– For the opposite situation, creating a new instance of the expired application file is necessary so as to have the unique and active uuid that can accept ‘custody’ of the still running slurm jobid, through appropriate edits.

And then, I’m still left confused as to how the ‘completed’ status of the desktop job was changed from ‘null’ to ‘complete’. While I can imagine anyone inadvertently deleting the desktop session explicitly, that would be expected to also terminate the associated slurm job. I’m glad to know, however, that an underlying slurm job might survive in cases where there are ‘glitches’ in the http/ws operations. Perhaps you can tell that I"m trying to associate this situation with the architecture diagram provided in the ood docs:
https://osc.github.io/ood-documentation/master/architecture.html#overview

Anyway, thanks very much. I’ll follow-up with my test system and see what I can do to exercise these ideas.
Cheers
~ Em

I believe we erroneously marked it completed because the squeue call failed for one reason or another. Regardless of the client’s location, the communication between the OOD server and Slurm failed and we interpreted that as Slurm saying there is no job, so it must be completed.

You can see this comment from a user a similar situation.

So this is why we started keeping the cards when they ‘complete’ and a subsequent patch for the slurm adapter to mark it ‘undetermined’ if it fails. But, we only accounted for 1 scenario/squeue output so if you happen to find other output similar to slurm_load_jobs error in the users nginx error.log (or perhaps the Slurm logs too?), we’re very interested in capturing that case too. This way the cards can be in an undetermined state for a bit but eventually come back to be marked as running without external intervention by admins or the user.

Hi, Jeff –

I tried the standard ‘kill’ against the by-now quite old nginx: master. The Passenger watchdog regenerated a new nginx master, as well as the ruby process (which was missing in the eariler set)

Before ‘kill’:
12756 0.0 0.0 394212 8496 ? Ssl Nov12 0:06 Passenger watchdog
12759 0.0 0.0 7672448 35420 ? Sl Nov12 5:50 Passenger core
root 12798 0.0 0.0 119396 3284 ? Ss Nov12 0:00 nginx: master process () -c /var/lib/ondemand-nginx/config/puns/.conf
12808 0.0 0.0 130176 4216 ? S Nov12 0:00 nginx: worker process

after ‘kill’:
46936 0.0 0.0 394212 8440 ? Ssl 12:32 0:00 Passenger watchdog
46939 0.8 0.0 7672448 35996 ? Sl 12:32 0:00 Passenger core
root 46967 0.0 0.0 119396 3284 ? Ss 12:32 0:00 nginx: master process () -c /var/lib/ondemand-nginx/config/puns/.conf
46979 0.0 0.0 130176 4212 ? S 12:32 0:00 nginx: worker process
47051 2.8 0.0 612956 70824 ? Sl 12:32 0:01 ruby /opt/rh/ondemand/root/usr/share/passenger/helper-scripts/rack-loader.rb

I’ll have to communicate with the researcher, and arrange to end all the processes, it would seem.
Cheers