VirtualGL in OOD apps

This is a question for the OOD team.

I have noticed that some OOD apps that OSC publishes, which can use OpenGL graphics (Ansys, Matlab, VMD, Paraview, …) use VirtualGL. VGL needs X server running on the compute node, which in turn will sit on the GPU and potentially eat away resources from the CUDA computational jobs that may run on that GPU.

So, my question is, how does OSC technically allow VirtualGL? Do you run X Server on all your compute nodes? Or, start it with a job start using a flag (which I don’t see in the OOD apps so probably not). Or have cheap GPUs in each node used for GL that are independent from the computational GPUs? Or, something else?

I’d appreciate some details that would allow us to consider such deployment over here.

Naturally, if other HPC centers have their own solution for compute node GL rendering it’d be great to hear them.

What we do over here is to have a set of standalone (interactive) nodes that run X and VGL on mid-range Nvidia GTX cards, but, we don’t have any X or VGL on the cluster compute nodes. Most of our computes only have onboard video cards and our GPU nodes are heavily utilized with computation, but, we’d like to see if there is a room for using the GPU nodes for GL with OOD apps like Ansys or Paraview.

Thanks,
MC

Hi, everybody

We are using such configuration : a graphic node supporting the turbovnc sessions and all the compute nodes running X11 with Nvidia driver. All the opengl codes called inside the script. sh. erb are prefixed by vglrun.

When I’ll be back in my office, I can add our config into this topic, if you’re interested

Jean Marie

Thanks @jms27000! @mcuma sorry I didn’t see this earlier! Yes I believe we have X11 libraries installed on all our compute nodes so that someone may run an interactive session on them, whether they have GPUs or not (and most do not, I can’ say the exact %, but let’s say only 1/4 do as an approximate guess).

So all compute nodes have VirtualGL libraries, and can run X11 sessions, but not all compute nodes have GPUs.

How we actually do segregation or limits, I’m not sure. If two users get scheduled on the same compute node, one requesting the GPU specifically, and the other by accident, what’s to stop the second user from using the GPU? cgroup configurations? Seems like something the scheduler should work out to limit the second user. I can’t say for sure.

@tdockendorf may have more for you.

OK, thanks. Looks like you all run X server then when the nodes are running. Since our admins are not fans of running X server on the computes, I think we’ll stick to our current setup with dedicated interactive nodes with X server and VirtualGL and revisit if users start demanding it on the computes (so far they don’t).

Our way of doing this is awful with Torque/Moab but for SLURM we have set a GRES that is no_consume named vis and if a job requests the vis GRES the SLURM TaskProlog will launch X on the GPU the job has been allocated. Because TaskProlog runs as the user and from within the job’s cgroup the X session is limited to the GPU requested and the CPU and memory resources available to the job. Once the job ends and the cgroup is deleted, the X session is killed as it was part of the cgroup. The SLURM cgroups limit access to GPUs as well as memory and CPU.

This is the code we use:

if [[ "$SLURM_LOCALID" == "0" && "$SLURM_JOB_GRES" == *"vis"* ]]; then
  if [ -n "$CUDA_VISIBLE_DEVICES" ]; then
    FIRSTGPU=$(echo $CUDA_VISIBLE_DEVICES | tr ',' "\n" | head -1)
    setsid /usr/bin/X :${FIRSTGPU} -noreset >& /dev/null &
    sleep 2
    if [ -n "$DISPLAY" ]; then
      echo "export OLDDISPLAY=$DISPLAY"
    fi
    echo "export DISPLAY=:$FIRSTGPU"
  fi
fi

We have some extra logic in both SPANK plugin, CLI filters and Job Submit filters that inject SLURM_JOB_GRES into job’s environment so we can access that in Prolog and TaskProlog. I can provide links to those if there is interest.