I found VirtualGL in OOD apps which almost answers what I am after but I have found handling multiple GPUs with VirtualGL and Slurm tricky to get right. The problem seems to be when I would run 2 jobs on the same node how Xorg decides which device to use. I found https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/remote-viz-tesla-gpus.pdf which is a bit dated now (2014) but doesn’t cover the multiple GPU. How have other sites done this?
In Prolog I have:
if [[ "$SLURM_JOB_CONSTRAINTS" == *"startx"* ]] then # launch the Xsever on all the nodes. # Clush not required, prolog run on every node. #clush -b -w $SLURM_JOB_NODELIST "/usr/bin/Xorg :0 &">/tmp/startx_log.txt /usr/bin/Xorg :0 > /tmp/startx0_log.txt 2>&1 & #/usr/bin/Xorg :1 > /tmp/startx1_log.txt 2>&1 & # wait a bit to ensure X is up before moving forward sleep 2 fi
And in Xorg.conf I have only 1 of the GPU devices defined for now:
Section "Device" Identifier "Device0" Driver "nvidia" VendorName "NVIDIA Corporation" BoardName "Tesla P100-PCIE-16GB" BusID "PCI:59:0:0" EndSection
Assuming we would for the moment have to use exclusive mode on the GPUs to make sure we don’t clash. Once we have Xorg and display we can then tell VirtualGL to use it but for now we are stuck not sharing a GPU node when a users requests to use VirtualGL.
Any ideas appreciated (and great Matlab talk this evening).
Tom (Cardiff University)