SGE integration


#1

Trying a test install of OnDemand 1.4 on VM CentOS 7 server. I have it running:

  1. Can log in
  2. Can view home directory, edit files, created directories and delete files
  3. Can start an ssh session into our HPC cluster head node
  4. I can submit a job to the cluster

But so far I cannot get OnDemand to list running jobs on the HPC cluster nor show user specific jobs on the Cluster.
This is my cluster config file. I’m not finding anything in the logs to help. I can login to the OnDemand server and run the qstat from CLI with no issues.

v2:
metadata:
title: “CoS HPC”
login:
host: “172.17.255.254”
job:
adapter: “sge”
cluster: “CoS Cluster”
bin: “/cm/shared/apps/sge/2011.11p1/bin/linux-x64”
conf: “/cm/shared/apps/sge/2011.11p1/”
sge_root: “/cm/shared/apps/sge/2011.11p1”
libdrmaa_path: “/cm/shared/apps/sge/2011.11p1/lib/linux-x64/libdrmaa.so”


#2

@cj.keist after the release I realized that not all SGE installations treat qstat the same way. The system I originally developed for had a qstat that worked like qstat -u '*', others work more like qstat -u $USER. There’s a fix for that behavior in the pipeline. Does your OOD user have any jobs running?


#3

Just to verify, do you know if the rpm you have installed is ondemand-1.4.10-1.el7.x86_64.rpm or ondemand-1.4.9-1.el7.x86_64.rpm? I do know that ondemand-1.4.10-1.el7.x86_64.rpm has a bugfix for SGE, though it isn’t the one that @rodgers.355 mentioned.


#4

The latest released version of ood_core fixes a crash that occurs when libdrmaa is used and the job has left the queue. That mostly impacts the Job Composer; users may notice that jobs never “complete” even though they won’t appear in the queue anymore.


#5

The qstat commands you listed do work on our SGE. Would you like output from them?


#6

The OnDemand package is: ondemand-1.4.10-1.el7.x86_64


#7

Just adding that the job list is showing jobs for the user logged in, just not showing all jobs running on the cluster right now.


#8

FWIW, our setup does the same thing. You can view your own jobs, but not all jobs on the Grid. We are running the latest Son of Grid Engine.


#9

@cj.keist and @deej that’s what I was talking about in terms of the bug / behavior that I changed for the unreleased version of the adapter. The new behavior always calls qstat -u the only difference for Active Jobs will be whether it uses $USER or *.

If either of you are interested in testing the unreleased version we can talk through a few options to do that?


#10

I’d be happy to help with testing.


#11

@deej thanks for agreeing to help test.

The library that defines the SGE adapter is ood_core. In order to run the version of the library with the latest SGE fixes you should do add the following line to each Gemfile rooted in /var/www/ood/apps/sys/(myjobs|dashboard|file-editor|activejobs):

gem "ood_core", :git => "https://github.com/OSC/ood_core.git", :ref => "878153a"

After adding the line to the file a sudo-er will need to run the command RAILS_ENV=production scl enable rh-git29 rh-ruby24 rh-nodejs6 -- bin/setup which will update the application.

This will pin the version of ood_core to what is currently the HEAD of the job_array branch for each application that reference in the library’s history. When you are done testing or want to upgrade you should remove that line from the Gemfile.

Then restart your PUN and you will be using the newer version of the SGE adapter.


#12

Hi,
Thank you for the patch. I added in the gem line in each GemFile in the myjobs,dashboard,file-editor and active jobs. I then ran the scl command in each folder (have to comment out the existing ood_core in the GemFile in dashboard) and then restarted my server. So far it looks to be working!!
Will so some more testing.


#13

Thank you! I can also confirm that the patch works, and we can now switch the view between just the person’s jobs and all jobs, and all jobs are shown.

I do notice one slight oddity. On this Grid we only have one queue defined, “all.q”. Some of the jobs correctly show the “all.q” queue, while most simply show “null” as a value for the queue. It doesn’t seem to affect anything but I thought you might want to know about it.


#14

I think I’ve seen that in testing as well. At a guess what is happening is that only jobs that explicitly set a target queue report their queue, while the others show the default value which is the not very useful null.


#15

That is possible based on what I’m seeing. I’ll do some additional testing to confirm.


#16

That is exactly the case. Two jobs submitted as:
qsub testme
and
qsub -q all.q testme

show up as “null” and “all.q” respectively.