Multi-GPU, Xorg and VirtualGL

tomgreen66 · October 28, 2021, 8:14pm

Hi,

I found VirtualGL in OOD apps which almost answers what I am after but I have found handling multiple GPUs with VirtualGL and Slurm tricky to get right. The problem seems to be when I would run 2 jobs on the same node how Xorg decides which device to use. I found https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/remote-viz-tesla-gpus.pdf which is a bit dated now (2014) but doesn’t cover the multiple GPU. How have other sites done this?

In Prolog I have:

if [[ "$SLURM_JOB_CONSTRAINTS" == *"startx"* ]]
then
  # launch the Xsever on all the nodes.
  # Clush not required, prolog run on every node.
  #clush -b -w $SLURM_JOB_NODELIST "/usr/bin/Xorg :0 &">/tmp/startx_log.txt
  /usr/bin/Xorg :0 > /tmp/startx0_log.txt 2>&1 &
  #/usr/bin/Xorg :1 > /tmp/startx1_log.txt 2>&1 &
  # wait a bit to ensure X is up before moving forward
  sleep 2
fi

And in Xorg.conf I have only 1 of the GPU devices defined for now:

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla P100-PCIE-16GB"
    BusID          "PCI:59:0:0"
EndSection

Assuming we would for the moment have to use exclusive mode on the GPUs to make sure we don’t clash. Once we have Xorg and display we can then tell VirtualGL to use it but for now we are stuck not sharing a GPU node when a users requests to use VirtualGL.

Any ideas appreciated (and great Matlab talk this evening).

Tom (Cardiff University)

Micket · October 29, 2021, 11:46am

If you go with the method I wrote in the end of that other thread you don’t need any background X-server, and it all just works correctly on, rendering on the jobs allocated GPU.

tomgreen66 · November 1, 2021, 12:16pm

Thanks. It does seem the most convenient and removes the need for X server complications. I will give it a try and report back any worthwhile feedback.

bennet · November 1, 2021, 1:14pm

@Micket Would you happen to have the identifier for the ‘other thread’? I am not finding it easily. We haven’t gone here just yet, but we plan to, and it would be great to have a link to the other information for future reference.

tomgreen66 · November 1, 2021, 11:22pm

@bennet - the link is in my original post if it helps…

I have managed to test it but I am seeing

$ vglrun +v glxgears
[VGL] Shared memory segment ID for vglconfig: 163869
[VGL] VirtualGL v2.6.95 64-bit (Build 20211022)
[VGL] Opening EGL device /dev/dri/card1
[VGL] ERROR: in init3D--
[VGL]    194: EGL_EXT_platform_device extension not available

I am running this in a Singularity container so it adds an extra layer of complexity. Any clues what I might be missing. Maybe the EGL Nvidia library is not being included in the container when using --nv?

Micket · November 2, 2021, 4:02pm

It’s really quite simple; get latest VirtualGL, open permissions on render devices (cgroup limits on the GPU itself still applies), and export the right VGL_DISPLAY for the GPU.

@tomgreen66 Yes that seems likely. I mean, it doesn’t attempt to bind mount any cuda libs into the container either right (in fact that it mounts nvidia-smi is already kind of surprising that it even works since the container OS might have completely different glibc.

From: centos:8

%post
    curl https://virtualgl.org/pmwiki/uploads/Downloads/VirtualGL.repo -o /etc/yum.repos.d/VirtualGL.repo
    printf "[nvidia]\nname=nvidia\nbaseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64\ngpgcheck=0" > /etc/yum.repos.d/nvidia.repo
    yum install -y epel-release
    yum repolist
    yum install -y VirtualGL glx-utils nvidia-driver-libs

Works for me with EGL backend

tomgreen66 · November 2, 2021, 10:11pm

Hi,

Thanks for the help. The vglserver_config step (that now has options for just EGL) when upgrading VirtualGL on the host was not rerun so didn’t set the permissions correctly on the /dev/dri/renderD* device as you mention in post. Also added /usr/share/glnvd from the host in the Singularity container and everything now seems to work as expected so thanks for that. Will do some tests but looking good and as you say much easier than messing around with Xorg configuration.

system · May 1, 2022, 10:12pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using Multi-Instance GPU (MIG) Get Help question	3	927	July 12, 2022
Install on single server Get Help	5	1346	May 26, 2022
Nesting/cascading dynamic widgets not working as expected Get Help	4	106	November 2, 2023
Invalid gres with batch_connect Get Help	9	247	September 27, 2023
Multiple SLURM clusters and OnDemand Get Help question	5	1403	May 17, 2022

Multi-GPU, Xorg and VirtualGL

Related Topics