Using Multi-Instance GPU (MIG)

Hi OSC community,
We are looking at getting a nice GPU environment for undergraduates here that will allow them to use Matlab, Jupyter, etc over Open OnDemand. I wanted to check that OOD was happy using MIG, that way i can get some nVidia A100s and split them into 7, which will give more GPU for the buck and also save on some precious rack space.
But is it possible?
Cheers,
Martyn

Martyn:

The general response I like to give when someone asks whether Open OnDemand can do “X” is to say that if a knowledgeable/sophisticated client on your systems can do “X” via some existing combination of command line tools / scripts / X Windows, then yes, OOD can do “X”. OOD heavily relies upon the existing underlying system software such as the operating system, resource manager, module system, etc. etc.

You didn’t state which scheduler you are using, but assuming you are using Slurm, I’m pretty sure Slurm is able to correctly handle MIG scheduling at this point. You’d just need to ensure that when you configure your apps in OOD they are passing the correct flags to Slurm for that.

We use Nvidia A100 with MIG Nodes for a dedicated interactive partition at New Mexico State University. OOD itself doesn’t need special integration so all you need to do is make sure your interactive apps are setup to be able to leverage GPUs. If you want an example of this take a look at our Persistent Shell Application.

Slurm support for MIG is where things can get a little more tricky. If you want to use the NVML autodetect you need to be running atleast Slurm 21.08.x and you will be needing to build Slurm yourself as OpenHPC does not build Slurm with Nvidia NVML support. If that is not an option you can hardcode the gres values but you will want to lookup the Nvidia docs on that as it is a bit complex.

Assuming you have the Slurm support handled you will also need to create a custom service that starts before slurmd that creates the MIG devices after every boot/reboot. MIG devices are unconfigured, or lost, after every reboot so you need to handle this along side the Nvidia persistence daemon.

Hope this helps.

edit:
Also, I forgot to mention the Nvidia MIG does not support 3d acceleration so keep that in mind for your intended use case. Cuda works just fine though but your applications/libraries need to support Cuda 11+. A good example is if you want to use tensorflow you need to be using at least version 2.5 or it won’t work with MIG GPUs.