Interactive jobs "disappearing"

We’re running Open OnDemand 1.7 with slurm.

Occasionally we get users reporting their interactive jobs are no longer available/listed in the portal. The job is still running in slurm and you can see the jobid in the job tracker.

We’ve found that when this happens, the interactive job data is missing from:
~USERNAME/ondemand/data/sys/dashboard/batch_connect/db

We can see files for other jobs in there, but the “missing” jobs have no matching entry in there.

How does that directory get managed? My supposition is that at some point, the user web-server process is reaped, and when they return, maybe slurm is slow to respond and so something partially cleans the directory… Does that sound plausible? Or is there something else in play here?

Thanks

We would like to know about this, also, as we’ve seen something
similar here, also with Slurm, and I believe we saw it with both Open
OnDemand 1.6 and 1.7.

Is this happening to any specific interactive app jobs? What I’m trying to say is, is there a pattern for certain interactive jobs?

This sounds like a bug with the Slurm adapter. Perhaps it is https://github.com/OSC/ood_core/issues/149. The code that is used to check the status of the job is https://github.com/OSC/ood_core/blob/5e24c28d1a2d5801335fc2a00e51f31fe2e122e2/lib/ood_core/job/adapters/slurm.rb#L456-L474. Essentially it boils down to:

  1. Call squeue with the job id as the argument
  2. If no job data is returned and the exit status is 0, assume job is completed
  3. If exit status is not 0 and output message contains the text “Invalid job id specified”, assume job is completed

It is likely there are edge cases where the above algorithm produces a “completed state” for a job that has not actually completed. In 1.6 and 1.7, the card would then disappear, but the job would still be running, as you report. In 1.8, the card would stick around for debugging purposes, but display as “completed” and not change back to the “running state”.

Do you have recommendations for a better algorithm to determine the completed state of a job? In https://github.com/OSC/ood_core/issues/149 @jeff.ohrstrom recommends maybe we need to assume if there is anything standard error we treat it as an error, even if the exit status is 0. Right now if exit status is 0 we ignore anything in stderr:

  o, e, s = Open3.capture3(env, cmd, *(args.map(&:to_s)), stdin_data: stdin.to_s)
  s.success? ? o : raise(Error, e)
end

Hmm… I think in our case mostly Matlab, though I haven’t dug into all the tickets we’d had internally.

Yes, I could believe it might be related to this issue. I know from parsing some of the other slurm info in some internal tooling, its not exactly easy.

Maybe the REST API for slurm would make this easier, but that would require substantial reworking I suspect!

Thanks for the pointers, will see if we can obtain some more debug data from squeue. This might be tricky if its transient timeouts though.

We haven’t noticed a pattern as far as applications. I think we have
seen this with Matlab, Rstudio, Jupyter, and with Remote Desktop,
which is pretty much everything we have.

We have seen timeouts with Slurm commands, and I am not sure where we
are with resolving those. The period during which the timeouts occur
is usually short, only a few minutes, then Slurm starts responding
again. If OOD is polling during one of those short outages, then that
would be quite plausible as an explanation.