Interactive jobs "disappearing"

sjt · September 7, 2020, 2:15pm

We’re running Open OnDemand 1.7 with slurm.

Occasionally we get users reporting their interactive jobs are no longer available/listed in the portal. The job is still running in slurm and you can see the jobid in the job tracker.

We’ve found that when this happens, the interactive job data is missing from:
~USERNAME/ondemand/data/sys/dashboard/batch_connect/db

We can see files for other jobs in there, but the “missing” jobs have no matching entry in there.

How does that directory get managed? My supposition is that at some point, the user web-server process is reaped, and when they return, maybe slurm is slow to respond and so something partially cleans the directory… Does that sound plausible? Or is there something else in play here?

Thanks

bennet · September 7, 2020, 2:48pm

We would like to know about this, also, as we’ve seen something
similar here, also with Slurm, and I believe we saw it with both Open
OnDemand 1.6 and 1.7.

kalattar · September 8, 2020, 1:52pm

Is this happening to any specific interactive app jobs? What I’m trying to say is, is there a pattern for certain interactive jobs?

efranz · September 8, 2020, 6:11pm

This sounds like a bug with the Slurm adapter. Perhaps it is squeue timeout reports as job being completed · Issue #149 · OSC/ood_core · GitHub. The code that is used to check the status of the job is https://github.com/OSC/ood_core/blob/5e24c28d1a2d5801335fc2a00e51f31fe2e122e2/lib/ood_core/job/adapters/slurm.rb#L456-L474. Essentially it boils down to:

Call squeue with the job id as the argument
If no job data is returned and the exit status is 0, assume job is completed
If exit status is not 0 and output message contains the text “Invalid job id specified”, assume job is completed

It is likely there are edge cases where the above algorithm produces a “completed state” for a job that has not actually completed. In 1.6 and 1.7, the card would then disappear, but the job would still be running, as you report. In 1.8, the card would stick around for debugging purposes, but display as “completed” and not change back to the “running state”.

Do you have recommendations for a better algorithm to determine the completed state of a job? In squeue timeout reports as job being completed · Issue #149 · OSC/ood_core · GitHub @jeff.ohrstrom recommends maybe we need to assume if there is anything standard error we treat it as an error, even if the exit status is 0. Right now if exit status is 0 we ignore anything in stderr:

  o, e, s = Open3.capture3(env, cmd, *(args.map(&:to_s)), stdin_data: stdin.to_s)
  s.success? ? o : raise(Error, e)
end

https://github.com/OSC/ood_core/blob/5e24c28d1a2d5801335fc2a00e51f31fe2e122e2/lib/ood_core/job/adapters/slurm.rb#L305-L306

sjt · September 8, 2020, 8:25pm

Hmm… I think in our case mostly Matlab, though I haven’t dug into all the tickets we’d had internally.

sjt · September 8, 2020, 8:31pm

Yes, I could believe it might be related to this issue. I know from parsing some of the other slurm info in some internal tooling, its not exactly easy.

Maybe the REST API for slurm would make this easier, but that would require substantial reworking I suspect!

Thanks for the pointers, will see if we can obtain some more debug data from squeue. This might be tricky if its transient timeouts though.

bennet · September 9, 2020, 12:12pm

We haven’t noticed a pattern as far as applications. I think we have
seen this with Matlab, Rstudio, Jupyter, and with Remote Desktop,
which is pretty much everything we have.

We have seen timeouts with Slurm commands, and I am not sure where we
are with resolving those. The period during which the timeouts occur
is usually short, only a few minutes, then Slurm starts responding
again. If OOD is polling during one of those short outages, then that
would be quite plausible as an explanation.

system · May 19, 2022, 5:59pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Interactive App completed, slurm job remains active Get Help question	11	1851	February 23, 2022
Interactive App Sessions Disappear Get Help ondemand2 , question	3	436	May 26, 2022
View past jobs? Get Help	2	207	July 18, 2023
Completed jobs history Get Help question	3	186	November 20, 2023
Jobs not showing up due to "Socket timed out" error Get Help ondemand2	2	241	October 8, 2023

Interactive jobs "disappearing"

Related Topics