Occasionally we get users reporting their interactive jobs are no longer available/listed in the portal. The job is still running in slurm and you can see the jobid in the job tracker.
We’ve found that when this happens, the interactive job data is missing from:
~USERNAME/ondemand/data/sys/dashboard/batch_connect/db
We can see files for other jobs in there, but the “missing” jobs have no matching entry in there.
How does that directory get managed? My supposition is that at some point, the user web-server process is reaped, and when they return, maybe slurm is slow to respond and so something partially cleans the directory… Does that sound plausible? Or is there something else in play here?
We would like to know about this, also, as we’ve seen something
similar here, also with Slurm, and I believe we saw it with both Open
OnDemand 1.6 and 1.7.
If no job data is returned and the exit status is 0, assume job is completed
If exit status is not 0 and output message contains the text “Invalid job id specified”, assume job is completed
It is likely there are edge cases where the above algorithm produces a “completed state” for a job that has not actually completed. In 1.6 and 1.7, the card would then disappear, but the job would still be running, as you report. In 1.8, the card would stick around for debugging purposes, but display as “completed” and not change back to the “running state”.
Yes, I could believe it might be related to this issue. I know from parsing some of the other slurm info in some internal tooling, its not exactly easy.
Maybe the REST API for slurm would make this easier, but that would require substantial reworking I suspect!
Thanks for the pointers, will see if we can obtain some more debug data from squeue. This might be tricky if its transient timeouts though.
We haven’t noticed a pattern as far as applications. I think we have
seen this with Matlab, Rstudio, Jupyter, and with Remote Desktop,
which is pretty much everything we have.
We have seen timeouts with Slurm commands, and I am not sure where we
are with resolving those. The period during which the timeouts occur
is usually short, only a few minutes, then Slurm starts responding
again. If OOD is polling during one of those short outages, then that
would be quite plausible as an explanation.