BUG - Updated OnDemand - Interactive Apps "Spill" over to other interactive apps on submission

OS: Centos 7.7
Job Scheduler: PBS Pro

We just updated OnDemand to 1.6.20 with yum and are having some issues. We have 2 sets of interactive apps. “bc_desktop” and “jupyter notebooks”. When we go to a bc_desktop app, say “cluster_1” and submit a VDI session, it will work. When we then go to the OOD interface and submit an interactive bc_desktop job on “cluster_2” it tries to submit it on “cluster_1”. When I click on the “session ID” to look at the scripts, things look fine, except for “connection.yml” which has the host variable set to a host in “cluster_1”. But the “job_script_content.sh” has the proper content of “cluster_2”, as do the remaining files.

I can go to “cluster_3” and try submitting a job but I get the issue here as well, but it seems like it then submits it to “cluster_2”. It seems like it’s always a cluster behind. If the PBS settings on cluster_3 don’t have the same queues, then we get a “queue doesn’t exist” error. But the submission script is just fine with the proper script.

We’re at a loss here and can’t reliably submit jobs. We did have replace the upgraded form.yml with our older form.yml in /var/www/ood/apps/sys/bc_desktop, but besides that and our custom submission scripts, we didn’t make any other changes. Any feedback is appreciated. Thanks.

This sounds strange to be sure. And to be clear, this worked fine for 1.5 (or lower) and all you did was yum update and replace the /var/www/ood/apps/sys/bc_desktop/form.yml?

My suggestion would be to quantify the behavior and the configurations. Also let us know what your older form.yml looks like for the bc_desktop app. I don’t think this is an issue, I just don’t want to rule it out.

It seems like either your cluster.d files are wrong or your apps are wrong. Though why this would work for you before I cannot say.

Sounds like everything that supposed to submit to cluster_1 does so correctly?

  • if so, you can rule out the clusters.d/cluster_1.yml and VDI/desktop/jupyter apps that use it

Cluster 2 submits to cluster 1.

  • verify the clusters.d/cluster_2.yml file, specifically that the host field is correct
  • verify the apps that are supposed to use cluster_2 actually do

Cluster 3 submits to cluster 2.

  • same steps for cluster 2 above.

I guess I would advise shortening the list of unknowns and possible failure points. Remove all doubt that things are configured incorrectly. I would also ask does this happen for both VDI jobs and jupyter notebooks? or just VDIs. That could help us narrow the issue as well.

The version of OOD we had before was 1.6.7 and it didn’t have the problem, but one of our other admins said that he has seen this problem before. I’ve tried to respond to each of your questions below:

this worked fine for 1.5 (or lower) and all you did was yum update and replace the /var/www/ood/apps/sys/bc_desktop/form.yml ?

We were on 1.6.7 and all we did was run “sudo yum update -y” and then replaced the new form.yml with our old form.yml in /var/www/ood/apps/sys/bc_desktop as you described.

Here is our form.yml

---
attributes:
  desktop: "xfce"
  bc_vnc_idle: 0
  bc_vnc_resolution:
    required: true
  node_type: null
  num_of_nodes: null
  bc_account: null
  num_of_gpus: null

  pbs_project:
    label: PBS Project
    help: |
      <help text removed>

  desktop:
    widget: select
    label: "Desktop environment"
    options:
      - ["MATE", "mate"]
      - ["Xfce", "xfce"]

form:
  - bc_vnc_idle
  - desktop
  - bc_account
  - pbs_project
  - custom_queue
  - bc_num_slots
  - num_of_gpus
  - num_of_nodes
  - memory_requested
  - bc_num_hours
  - node_type
  - bc_queue
  - bc_vnc_resolution
  - bc_email_on_started

It seems like either your cluster.d files are wrong or your apps are wrong. Though why this would work for you before I cannot say.

We rolled back the VM that we updated to before we ran all of the yum updates. So it was back to being CentOS 7.6 and OOD 1.6.7, and the problem persisted. Which we really don’t understand. Maybe it did mysteriously work and it shouldn’t have? I don’t understand why it would have all of a sudden just stopped working.

Sounds like everything that supposed to submit to cluster_1 does so correctly?

I should have specified better. Submissions don’t always go to cluster_1 correctly. It seems to always just be one behind. If I go to cluster_3 first it will work, but then go to cluster_1, it then tries to submit it on cluster_3. Something that we did find that may be helpful is that if you wait a few minutes to submit a new job, it will NOT submit to cluster_3, but will submit to cluster_1 properly. This is where we’re currently thinking the problem might be.

verify the clusters.d/cluster_2.yml file, specifically that the host field is correct
verify the apps that are supposed to use cluster_2 actually do

We’ve verified that all of our clusters.d/* files are configured correctly. They haven’t changed since they were updated.

I would also ask does this happen for both VDI jobs and jupyter notebooks? or just VDIs. That could help us narrow the issue as well.

Seems to happen to any and all interactive apps. We only have VDI and jupyter notebooks currently as our only interactive apps.

Just ran another test. Restarted apache on the OOD server. Signed in, launched interactive app on cluster_X. It submitted successfully. Attempted launch of interactive app on cluster_Y, it submits it to Cluster_X, launches on cluster_X. Attempted launch of interactive app on cluster_Z, submits it to cluster_X, launches on cluster_X. Attempted launch of interactive app on 4th cluster, cluster_A, submits it to cluster_X, stays queued on cluster_X because it’s out of room because my other 3 jobs took up all the space for jobs on cluster_X.

Now if I wait a few minutes to launch a job on a 5th system, it will properly launch on that system. So like I mentioned, it seems like it’s some kind of caching issue? Not so much that it’s just one behind, but that it lags in time behind?

Edit: I also noticed that in the connection.yml of one of the interactive apps that gets submitted to the wrong cluster that the host line is the host of the wrong cluster. Where does connection.yml pull it’s information from?

OK! Yea it must be a caching issue. I guess the question is where? Do you have a caching layer somewhere?

The connection.yml get’s generated during the job execution. It generates the host field from the hostname command. So that should always be correct, in that the job really did run on that machine, in that cluster.

Can you provide a screenshot from your browsers dev tools’ like below? Specifically I’m interested in the first request (the 302 POST) and the next few. Also interested in the request payload (the form data).

Can you upload an image like this
image

Edit: Sorry, to answer your questions, we don’t have any caching layers anywhere. Not intentionally at least. It seems like maybe it’s not logging out of the old server fast enough or not getting rid of that variable? I’m really not sure. I haven’t found anywhere else that the variables are set incorrectly though. Only in connection.yml.

So we thought this might be a browser issue as well. We opened two different browsers, firefox and chrome. Signed into our ondemand VM. Opened interactive apps. Launched an interactive app on cluster_X on one browser, then moved to the other browser and launched an interactive app on the other other cluster, and it still launched it on cluster_X. So I don’t think it’s a browser issue.

I also went into the server’s nginx logs and it POST for my user seems to be going to the correct servers.

[02/Mar/2020:10:59:20 -0700] "POST /pun/sys/dashboard/batch_connect/sys/jupyter/CLUSTER_1/session_contexts HTTP/1.1" 302 150 "https://OOD_VM/pun/sys/dashboard /batch_connect/sys/jupyter/CLUSTER_1/session_contexts/new" "Mozilla/5.0 (X11; Linux x86_64)

[02/Mar/2020:10:59:43 -0700] "POST /pun/sys/dashboard/batch_connect/sys/bc_desktop/CLUSTER_2/session_contexts HTTP/1.1" 302 150 "https://OOD_VM/pun/sys/dashboard/batch_connect/sys/bc_desktop/CLUSTER_2/session_contexts/new" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.131 Safari/537.36"

I’m looking for the browser dev tools that look like the picture you provided. I’ll get that when I can, but I’m not thinking it’s a browser issue.

Edit 2: I wasn’t able to get exactly what your picture had. What I have here is probably not at all helpful. Sorry.

OK, that seems absolutely wonky. Below is how we seem to do it, through the PBS_DEFAULT environment variable.

I’d be interested in what you find. If you find cluster_x the first time, cluster_x the second (failing to submit to cluster_y) and cluster_y the third (failing to submit to cluster_z) then we’re setting the environment wrong, caching the previous value?.

If not, then we’re unable to reset the environment?

grep execve /var/log/ondemand-nginx/johrstrom/error.log
App 68132 output: [2020-03-02 14:47:49 -0500 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"pitzer-batch.ten.osc.edu\", \"LD_LIBRARY_PATH\"=>\"/opt/torque/lib64:/opt/rh/rh-nodejs10/root/usr/lib64:/opt/rh/rh-ruby25/root/usr/local/lib64:/opt/rh/rh-ruby25/root/usr/lib64:/opt/rh/httpd24/root/usr/lib64:/opt/ood/ondemand/root/usr/lib64\"}, \"/opt/torque/bin/qsub\", \"-d\", \"/users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_jupyter_pitzer/output/6404ccce-a27d-423a-8367-0d88e6867177\", \"-N\", \"ondemand/sys/dashboard/sys/bc_osc_jupyter_pitzer\", \"-S\", \"/bin/bash\", \"-o\", \"/users/PZS0714/johrstrom/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_jupyter_pitzer/output/6404ccce-a27d-423a-8367-0d88e6867177/output.log\", \"-j\", \"oe\", \"-l\", \"walltime=01:00:00\", \"-l\", \"nodes=1:ppn=1\", \"/tmp/qsub.20200302-68132-sn45ak\"]"

So this is just the end of the file, with some info redacted of course, but replaced with “CLUSTER_#”. I’ve got more to the file after running your grep command, but let me know if this is helpful. It looks like the end is the troubling part. All of them resolve to the same cluster, which might be the problem.

App 5782 output: [2020-03-02 13:03:40 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:46 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:46 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3221.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:46 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:51 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:51 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3221.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:51 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:56 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:56 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3221.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:03:56 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"

I think we’re on the right track! That is helpful. Though it’s qstat. Can you show me the sequence of qsub commands, maybe from a series of tests in the 4th comment above (about clusters x, y and z).

Oh yeah, that makes sense. Here’s the qsub commands, along with a few qstat commands from that same log file.

So those qsub commands don’t show what host they’re being executed on. I shortened that file a little bit because it’s huge, but it’s just full of either those qsub or qstat commands. There are a few qdel commands once I delete the interactive jobs, but those, like the qstat commands, show the incorrect cluster being used.

Sorry, what tests are you referring to?

App 5782 output: [2020-03-02 13:02:23 -0700 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 5782 output: [2020-03-02 13:02:34 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qsub\", \"-N\", \"sys/dashboard/sys/jupyter/CLUSTER_1\", \"-S\", \"/bin/bash\", \"-o\", \"/home/USERNAME/ondemand/data/sys/dashboard/batch_connect/sys/jupyter/CLUSTER_1/output/dd9fc5ce-f79e-453d-bc28-f7804ba19678/output.log\", \"-l\", \"walltime=10:00:00\", \"-j\", \"oe\", \"-l\", \"select=1:ncpus=24\", \"-P\", \"hpc\", \"-q\", \"general\", \"-N\", \"OOD-VIZ-Jupyter\"]"
App 5782 output: [2020-03-02 13:02:34 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:34 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:40 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qsub\", \"-N\", \"sys/dashboard/sys/bc_desktop/CLUSTER_2\", \"-S\", \"/bin/bash\", \"-o\", \"/home/USERNAME/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/CLUSTER_2/output/24dfd2c9-ebc5-454a-9a65-2cc29470f34e/output.log\", \"-l\", \"walltime=01:00:00\", \"-j\", \"oe\", \"-l\", \"select=1:ncpus=1:ngpus=1:mem=62gb\", \"-P\", \"hpc\", \"-q\", \"general\", \"-N\", \"OOD-CLUSTER_2InteractivewithGPU\", \"-l\", \"walltime=1:0:0\", \"-v\", \"DOCKER_IMAGE='DOCKER_IMAGE'\"]"
App 5782 output: [2020-03-02 13:02:40 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:40 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:44 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qsub\", \"-N\", \"sys/dashboard/sys/bc_desktop/CLUSTER_3\", \"-S\", \"/bin/bash\", \"-o\", \"/home/USERNAME/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/CLUSTER_3/output/f563659a-0f9f-4d26-8137-83b39eacf5f2/output.log\", \"-l\", \"walltime=01:00:00\", \"-j\", \"oe\", \"-l\", \"select=1:ncpus=1:mem=2gb\", \"-P\", \"hpc\"]"
App 5782 output: [2020-03-02 13:02:44 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:44 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3221.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:44 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:45 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:45 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3221.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:45 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3219.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:55 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3220.CLUSTER_1\"]"
App 5782 output: [2020-03-02 13:02:55 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_3.\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3221.CLUSTER_1\"]"

So I was looking in the documentation, and I saw this “test configuration” page. I was trying this out and I noticed that I was getting the same results as I was when I’d submit a job from the webpage. Here’s the link I was referring to - https://osc.github.io/ood-documentation/master/installation/resource-manager/test.html

$ su <MY USER> -c 'scl enable ondemand -- bin/rake test:jobs:CLUSTER_1 RAILS_ENV=production'                                                  
Testing cluster 'CLUSTER_1'...
Submitting job...
[2020-03-02 14:22:33 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qsub\", \"-N\", \"test_jobs_vizpbs\", \"-S\", \"/bin/bash\", \"-o\", \"/home/USERNAME/test_jobs/output_CLUSTER_1pbs_2020-03-02T14:22:33-07:00.log\", \"-l\", \"walltime=00:01:00\", \"-j\", \"oe\"]"
Got job id '3239.CLUSTER_1'
[2020-03-02 14:22:33 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3239.CLUSTER_1\"]"
Job has status of running
[2020-03-02 14:22:38 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_1\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3239.CLUSTER_1\"]"
Job has status of completed
Output file from job does not exist: /home/USERNAME/test_jobs/output_CLUSTER_1_2020-03-02T14:22:33-07:00.log
Test for 'CLUSTER_1' FAILED!
Finished testing cluster 'CLUSTER_1'

Now if I immediately do this on a second cluster,

$ su USERNAME -c 'scl enable ondemand -- bin/rake test:jobs:CLUSTER_2 RAILS_ENV=production'                                                 
Testing cluster 'CLUSTER_2'...
Submitting job...
[2020-03-02 14:25:46 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qsub\", \"-N\", \"test_jobs_CLUSTER_2\", \"-S\", \"/bin/bash\", \"-o\", \"/home/USERNAME/test_jobs/output_CLUSTER_2_2020-03-02T14:25:46-07:00.log\", \"-l\", \"walltime=00:01:00\", \"-j\", \"oe\"]"
Got job id '3241.CLUSTER_1'
[2020-03-02 14:25:46 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3241.CLUSTER_1\"]"
Job has status of running
[2020-03-02 14:25:51 -0700 ]  INFO "execve = [{\"PBS_DEFAULT\"=>\"CLUSTER_2\", \"PBS_EXEC\"=>\"/opt/pbs\"}, \"/opt/pbs/bin/qstat\", \"-f\", \"-t\", \"3241.CLUSTER_1\"]"
Job has status of completed
Test for 'CLUSTER_2' PASSED!
Finished testing cluster 'CLUSTER_2'

However if I wait for 60 seconds, which seems to be the right amount of time, and submit again, it will then go to cluster_2. So I think somewhere, there is a timeout of 60 seconds that waits. I don’t know where it would be though. Hopefully this helps.

@brandon-biggs all this info is very helpful. Sorry, I didn’t mean test like the formal tests, just what you’d been doing yesterday trial and error trying to launch things.

Looking at the logs you’ve provided, it looks like you did this

  • submit juptyer to cluster_1
  • submit a bc_desktop to cluster_2
  • submit a bc_desktop to cluster_3

I imagine the 2nd and 3rd submissions didn’t actually work, but if you look at the logs we are setting the PBS_DEFAULT to the correct cluster. Correct?

@jeff.ohrstrom You’re correct. That was the 3 submissions that I did, and the 2nd and 3rd submissions did not work.

PBS_DEFAULT was set to the correct cluster, but qsub submitted the job to the wrong cluster and unless I wait 60 seconds, it would always submit to the wrong cluster.

Did you happen to also update PBS in this upgrade? Is there any thing else that changed during this upgrade?

So we updated a lot of things, but then we rolled back the entire VM. So all yum updates on the VM would have been rolled back. We did update some of our other machines, but we don’t update/install PBS with yum on our clusters, and we didn’t update any of our clusters. So none of that should have changed from my understanding.

So what would have updated are general packages from a yum update + the rpms from ondemand.

OK, I’ll be looking into why that environment variable isn’t being respected or why the previous value is being cached. There was no update in the PBSPro adapter between 1.6.7 and 1.6.20 so I’m kind of at a loss.

Can you try submitting through a terminal session and see what happens there?
PBS_DEFAULT=cluster_x qsub hello_world.sh then try the same to cluster_y. Be sure to inline the environment variable.

I’m looking at the pbs documentation about this. I can only assume it’s pbs caching. We’re creating a new sub process with a new environment. Is this a test cluster or production? If it’s a test, we could create a wrapper shell to dump the environment like below with the bin_overrides attribute for qsub.

#!/bin/bash

# I can't say this is secure 
env > /$HOME/qsub.env

qsub $@

Now I’m beginning to suspect funny business with the -q option. I see you pass it for your first 2 jobs, but not the third. I wonder what happens if you submit your bc_desktop to cluster_3 multiple times in a row?

@jeff.ohrstrom Okay, I tried submitting PBS_DEFAULT=cluster_x qsub hello_world.sh where hello_world.sh is just a simple echo $(hostname). This works on each different cluster.

These are production and test clusters. We have two different instances of OOD. A dev instances and a production instance. Both are experiencing these issues.

The -q not on the third is probably because our test cluster doesn’t require it. Our production clusters do require a queue, but our dev ones do not.

So I guess in your dev instance I would suggest forcing the queue to have a server name too. like -q general@cluster_2 for jobs 1 and 2.

See if that fixes it, by just submitting jobs 1 and 2, back and forth.

If it does fix it, then add job 3 - with no -q option - in mix. Scheduling job 1 then job 3 and so on.

@brandon-biggs did you make any progress with specifying the cluster with the queue?