Multi-GPU, Xorg and VirtualGL

Hi,

I found VirtualGL in OOD apps which almost answers what I am after but I have found handling multiple GPUs with VirtualGL and Slurm tricky to get right. The problem seems to be when I would run 2 jobs on the same node how Xorg decides which device to use. I found https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/remote-viz-tesla-gpus.pdf which is a bit dated now (2014) but doesn’t cover the multiple GPU. How have other sites done this?

In Prolog I have:

if [[ "$SLURM_JOB_CONSTRAINTS" == *"startx"* ]]
then
  # launch the Xsever on all the nodes.
  # Clush not required, prolog run on every node.
  #clush -b -w $SLURM_JOB_NODELIST "/usr/bin/Xorg :0 &">/tmp/startx_log.txt
  /usr/bin/Xorg :0 > /tmp/startx0_log.txt 2>&1 &
  #/usr/bin/Xorg :1 > /tmp/startx1_log.txt 2>&1 &
  # wait a bit to ensure X is up before moving forward
  sleep 2
fi

And in Xorg.conf I have only 1 of the GPU devices defined for now:

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla P100-PCIE-16GB"
    BusID          "PCI:59:0:0"
EndSection

Assuming we would for the moment have to use exclusive mode on the GPUs to make sure we don’t clash. Once we have Xorg and display we can then tell VirtualGL to use it but for now we are stuck not sharing a GPU node when a users requests to use VirtualGL.

Any ideas appreciated (and great Matlab talk this evening).

Tom (Cardiff University)

If you go with the method I wrote in the end of that other thread you don’t need any background X-server, and it all just works correctly on, rendering on the jobs allocated GPU.

Thanks. It does seem the most convenient and removes the need for X server complications. I will give it a try and report back any worthwhile feedback.

@Micket Would you happen to have the identifier for the ‘other thread’? I am not finding it easily. We haven’t gone here just yet, but we plan to, and it would be great to have a link to the other information for future reference.

@bennet - the link is in my original post if it helps…

I have managed to test it but I am seeing

$ vglrun +v glxgears
[VGL] Shared memory segment ID for vglconfig: 163869
[VGL] VirtualGL v2.6.95 64-bit (Build 20211022)
[VGL] Opening EGL device /dev/dri/card1
[VGL] ERROR: in init3D--
[VGL]    194: EGL_EXT_platform_device extension not available

I am running this in a Singularity container so it adds an extra layer of complexity. Any clues what I might be missing. Maybe the EGL Nvidia library is not being included in the container when using --nv?

It’s really quite simple; get latest VirtualGL, open permissions on render devices (cgroup limits on the GPU itself still applies), and export the right VGL_DISPLAY for the GPU.

@tomgreen66 Yes that seems likely. I mean, it doesn’t attempt to bind mount any cuda libs into the container either right (in fact that it mounts nvidia-smi is already kind of surprising that it even works since the container OS might have completely different glibc.

From: centos:8

%post
    curl https://virtualgl.org/pmwiki/uploads/Downloads/VirtualGL.repo -o /etc/yum.repos.d/VirtualGL.repo
    printf "[nvidia]\nname=nvidia\nbaseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64\ngpgcheck=0" > /etc/yum.repos.d/nvidia.repo
    yum install -y epel-release
    yum repolist
    yum install -y VirtualGL glx-utils nvidia-driver-libs

Works for me with EGL backend

Hi,

Thanks for the help. The vglserver_config step (that now has options for just EGL) when upgrading VirtualGL on the host was not rerun so didn’t set the permissions correctly on the /dev/dri/renderD* device as you mention in post. Also added /usr/share/glnvd from the host in the Singularity container and everything now seems to work as expected so thanks for that. Will do some tests but looking good and as you say much easier than messing around with Xorg configuration.