Jupyter Launched via Docker and Cleaning Up Containers

cupdike · January 26, 2021, 6:29pm

I’m using Docker to launch NVidia containers to run Jupyter. Unfortunately, the containers are not stopped when the job completes, and resources are withheld as a result. I’m thinking I could handle this by generating a random string in before.sh.erb, save it as an env var, use it as part of the docker run command in script.sh.erb, and then clean up in after.sh.erb. Does this sound about right?

A related question is, am I reinventing the wheel here? I would think this would be a common need. Especially since DeepOps uses OOD. Makes me wonder if I’m overlooking a prior solution to this…
Thanks, Clark

jeff.ohrstrom · January 27, 2021, 8:42pm

The after.sh.erb doesn’t run after the job, it runs after the script.sh.erb.

As to the actual issue you’re facing, I’m not sure what’s going on. What scheduler are you using? It seems like the scheduler should know that the container was launched as a part of the job (or does indeed know because it continues to withhold resources) and do what’s required to stop it when the job stops.

Which is to say, OOD doesn’t play much of a role at this point. Once we schedule the job it’s just a script running in the scheduler’s hands. I’m wondering if there’s some misconfiguration on your scheduler’s side that allows this. Can you replciate similar behavior from the command line? Like running a container that sleeps for 7200 but the job only has a walltime of 3600?

cupdike · January 27, 2021, 9:33pm

This is slurm. My script runs this:

docker run --rm --mount type=bind,source=/home/myser/ondemand/data/sys/dashboard/batch_connect/dev/jupyter-docker/output/63a4aa68-7d37-4a5d-8100-b2b146a497c6/config.py,target=/config.py --mount type=bind,source=/home/myuser,target=/home/myuser -p 31636:31636 nvcr.io/nvidia/tensorflow:20.12-tf2-py3 jupyter lab --allow-root --config=/config.py

That should be running the container in the foreground (versus -d which would make it detached). Any thoughts as to why the container keeps running after the job completes?

jeff.ohrstrom · January 28, 2021, 1:20am

My guess is it’s not a part of the jobs’ cgroup somehow. We don’t have docker at our site, but on my machine the default cgroup seems to be systemd’s. Maybe try docker run with --cgroup-parent=/proc/self/cgroup. Seems like you’d also need slurm.conf configured with ProctrackType=proctrack/cgroup.

Also this could be a reason to avoid docker and maybe try podman? Glancing at the Slurm docs for containers, apparently even Nvidia has a container runtime enroot. Though I’ve only heard about it now and can’t really say much about it. But I can speak to podman’s usefulness a lot. Completely unprivileged drop in replacement for docker so you don’t have to worry about stuff like this (let alone the security!).

cupdike · January 28, 2021, 3:02pm

We looked at Podman but it seemed like it would require us to manage ranges of subuids/subguids for onboarded users which isn’t desirable.

Trying to run straight Docker looks like too problematic to consider:

Will take a look at enroot…

Also worth mentioning–it works fine with Singularity-converted containers. I was just hoping to avoid the conversion step. But aside from that, it seems like a pretty clean solution.

Thanks for the pointers…

system · March 14, 2022, 5:49pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Run an OnDemand application with Slurm and Enroot Get Help	2	224	May 8, 2023
SLURM job info in context Get Help	4	470	May 26, 2022
Jupyter launch error Get Help	3	643	May 19, 2022
Launch app on specific host skipping the scheduler Get Help question	33	1087	May 26, 2022
Node Administration for Interactive Apps General Discussion ondemand2 , question	1	243	March 23, 2023

Jupyter Launched via Docker and Cleaning Up Containers

Related Topics