Starting session panel disappears

Hi All,
Occasionally users experience that the starting session panel disappears after ~1 minute and the “connect to application” panel won’t show up.

Starting session panel:

After checking the logs, it looks like the job was submitted successfully and ran until the queue run time limit was reached.

output.log

[...]
Started at Thu Jun  4 12:41:43 2020
Terminated at Thu Jun  4 13:41:43 2020
Results reported at Thu Jun  4 13:41:43 2020
[...]
Script starting...
Waiting for RStudio Server to open port 35714...on host server
ESC[34mINFO:   ESC[0m Converting OCI blobs to SIF format
ESC[34mINFO:   ESC[0m Starting build...
Getting image source signatures
Copying blob sha256:d51af753c3d3a984351448ec0f85ddafc580680fd6dfce9f4b09fdb367ee1e3e
Copying blob sha256:fc878cd0a91c7bece56f668b2c79a19d94dd5471dae41fe5a7e14b4ae65251f6
Copying blob sha256:6154df8ff9882934dc5bf265b8b85a3aeadba06387447ffa440f7af7f32b0e1d
Copying blob sha256:fee5db0ff82f7aa5ace63497df4802bbadf8f2779ed3e1858605b791dc449425
Copying blob sha256:c86b5b5c0c52d2949c29006e38632744bb7cc3f88525959f70375692f5a90983
Copying blob sha256:2c0783ae876319800777d0ba051919b455bd5fafe8e3ba10bfcf3d8e8875bd9b
Copying blob sha256:0e5ecbaa74330c55752b884a5946537d2ae4a4da892ce817392948d9606c2daa
Copying blob sha256:e424bb47af136bcc59cde73b78f3afe16b75b9a33b951e09bcfc639f6c5020e9
Copying config sha256:70ee5049912bba1a24eeabc6a5fa59b76646a367eb3551e855587bbfa627ab89
Writing manifest to image destination
Storing signatures
2020/06/04 12:42:05  info unpack layer: sha256:d51af753c3d3a984351448ec0f85ddafc580680fd6dfce9f4b09fdb367ee1e3e
2020/06/04 12:42:06  info unpack layer: sha256:fc878cd0a91c7bece56f668b2c79a19d94dd5471dae41fe5a7e14b4ae65251f6
2020/06/04 12:42:06  info unpack layer: sha256:6154df8ff9882934dc5bf265b8b85a3aeadba06387447ffa440f7af7f32b0e1d
2020/06/04 12:42:06  info unpack layer: sha256:fee5db0ff82f7aa5ace63497df4802bbadf8f2779ed3e1858605b791dc449425
2020/06/04 12:42:06  info unpack layer: sha256:c86b5b5c0c52d2949c29006e38632744bb7cc3f88525959f70375692f5a90983
2020/06/04 12:42:06  info unpack layer: sha256:2c0783ae876319800777d0ba051919b455bd5fafe8e3ba10bfcf3d8e8875bd9b
2020/06/04 12:42:12  info unpack layer: sha256:0e5ecbaa74330c55752b884a5946537d2ae4a4da892ce817392948d9606c2daa
2020/06/04 12:42:18  info unpack layer: sha256:e424bb47af136bcc59cde73b78f3afe16b75b9a33b951e09bcfc639f6c5020e9
ESC[34mINFO:   ESC[0m Creating SIF file...
Starting up rserver...
+ SCRATCH_MOUNT=/path/user
+ export SINGULARITY_CACHEDIR=/path/user/.singularity
+ SINGULARITY_CACHEDIR=/path/user/.singularity
+ SINGULARITYENV_RSTUDIO_PASSWORD=password
+ singularity exec -c -H /home/user/home_rstudio:/home/user -B /tmp/tmp.brV9LOtAZA:/tmp -B /path/user -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/Rprofile.site:/etc/R/Rprofile.site -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/bin/auth:/bin/auth -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/rsession.sh:/rsession.sh -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/rsession.log:/rsession.log docker://rocker/rstudio:latest /usr/lib/rstudio-server/bin/rserver --www-port 35714 --auth-none 0 --auth-pam-helper-path /bin/auth --auth-encrypt-password 0 --rsession-path /rsession.sh
Discovered RStudio Server listening on port 35714!
Generating connection YAML file...
User defined signal 2

Since the job started successfully in the back end, may it be that there was an issue with communicating the state of the job to the front end properly?
Do you have an idea how we can pin down the cause of this issue?

Hey @rotaugenlaubfrosch!

The reason that the session may be stuck in a pending state could be that your login nodes aren’t able to see connection.yml generated on the compute nodes when starting a session.

Start with checking if your file system writes from your compute nodes are syncing quick enough to your login nodes.

Someone had a similar problem not too long ago:

Hi @mario
Thanks for your reply! The job isn’t suck at pending state, but the panel completely disappears as if the job has ended. We have checked the shared FS, but it probably isn’t related to it since synchronization takes place fast.

@rotaugenlaubfrosch

This may be related to an open issue where jobs fail silently: https://github.com/OSC/ondemand/issues/232

I think Jupyter is failing to listen on a port. Could you share script.sh.erb? I will try and debug this locally

Hi @mario
Thanks for your help. Usually RStudio works fine - this problem just appears occasionally.
Next time it happens, we’ll investigate again and I will follow up with you.
Best regards

We were able to reproduce this issue.
We found out that the problem was caused by slow data sync between different filesystems.
Apart from that the following question arose: How does the front end check whether the application has started? Or in other words, how does the front end know when it should replace the “starting” session panel with the “running” or “connect to” panel?

@rotaugenlaubfrosch When on the My Interactive Sessions page, there is JavaScript in place to poll every every 5 seconds on a timer to check the status and re-render the session panel.

Here’s where that happens:

Hi all, I’m working with @rotaugenlaubfrosch.
We are probably get closer to a solution (or at least to an understanding of the issue).
We have 2 OOD instances in different location and they share a path, let’s call it “global”.
The “ood” folder used for “data” folder is in global:
In the first instance, let’s call it ood1, we start an interactive app: APPNAME
/global/ood/data/sys/dashboard/batch_connect/sys/APPNAME/output/SessionID1
the apps starts without problems.
Then I connect to the second instance, ood2, in another location. I check “My interactive Sessions”, I see “You have no active sessions.”, I reload the page and the session in the first instance (ood1) is gone. Job will still run but I have no sessions anymore.
Where are the data for “Sessions” stored? How are the 2 instances using the same /global/ood/data/sys/dashboard/batch_connect/sys/APPNAME folder interfering?

Is there anyone that can help answering my questions?
To recap:

  1. Which are the files checked by the web app to understand if there’s any open session?
  2. Is it possible to have 2 different OOD instances sharing the same path for interactive applications without having conflicts?

Thanks for you help

Hi sorry for the delay.

By default it will look in this directory ~/ondemand/data/sys/dashboard/batch_connect/db/ for sessions.

Yes this is possible, we do this at OSC for our test, dev and production instances. They all share our home NFS directories. (Note here, I specify home NFS directories. I just want to be clear that because we’re relying on file permissions, everything we do, every file we write is assumed be in either a home directory or similar. Your use of global may imply it’s a shared location where anyone can write to (like /tmp) and I want to be clear that ever user needs their own unique directory to write to).

However, all of the instances have the same cluster configurations. That is, if I start a job on instance1 to cluster1, then login to instance2, cluster1 needs to exist on that instance with the same configuration.

My guess is, in your case, instance 2 saw the job created by instance 1, queried the cluster (as defined by it’s cluster configs and the .cluster_id attribute of the file), could not find it and so assumed it was ‘complete’.

Thanks for your answer. It makes sense now. The “cluster_id” is the same but the different clusters (managed with LSF Multi-cluster) are not sharing the job queue and when you run a “bjobs” from a cluster you just see the jobs running on that cluster. So, as you said, the instance2 checks the jobs (I guess based on the job_id specified in the “db” folder) and not seeing it, it assumes it was completed. In principle you can pass the option “-m cluster_name” to specify the cluster to query but I’m not sure this is used here. So a couple of questions:

  1. Would it be possible to use the “cluster_id” value with the “-m” option in LSF (bjobs -m cluster_id)?
  2. I assume cluster_id relates to the cluster where the ood instance that start the job request is, not the cluster that actually execute the job, right? Would it be possible to customise the info stored as “cluster_id” (again to be used with the “-m” option)?

cluster_id is OOD parlance and comes from the name of the cluster.yml file. It should be the same as the cluster, but doesn’t necessarily have to be.

You can specify cluster attribute in the configuration which will trigger the -m flag being populated, but apparently that’s not documented, so that’s a miss on our side.

Here’s an LSF configuration example that illustrates both of these.

# cluster_id will be "owens_cluster" not "owens"!  So all references to this
# cluster in any form.yml will have to be "owens_cluster".
#
# /etc/ood/config/clusters.d/owens_cluster.yml
---
v2:
  metadata:
    title: "Owens"
  login:
    host: "owens.osc.edu"
  job:
    adapter: "lsf"
    bindir: "/path/to/lsf/bin"
    libdir: "/path/to/lsf/lib"
    envdir: "/path/to/lsf/conf"
    serverdir: "/path/to/lsf/etc"
    # here is the missing cluster attribute
    cluster: "owens"

Sorry for late reply.
I’m not sure I got it completely.
In our case we have 3 sites but we can make the example with just 2:

  1. Cluster1 with OOD instance on a login node of that cluster1
  2. Cluster2 with a second instance of OOD on a login node on the cluster2.

Now from Cluster2, for checking jobs in Cluster1 you need to run bjobs -m Cluster1.
That’s why, if the cluster name is part of the interactive app info (like in cluster_id) then we can have the session checking the correct cluster.
From the configuration you shared this seem to be local to the OOD instance so, for what I understand, cluster2 will run “bjobs -m Cluster2” and cluster1 will be running “bjobs -m Cluster1” (because they will look at the “local” name). Is my understanding correct?
The way it may work for us is:

  1. Submit the interactive app job specifying the cluster (bsub -cluster Cluster1)
  2. The information get stored in the “cluster_id” (cluster_id=Cluster1)
  3. Session can check the job using “bjobs -m cluster_id” (so this should work either from any OOD instance).

Do you think that’s something feasible?

You’re understanding is close. In short: the cluster.d filenames should match the v2.job.cluster attribute. And you should probably have 3 of them, one for each cluster.

Here’s a very short example for cluster_1 (that’s missing all the other stuff above)

# /etc/ood/config/clusters.d/cluster_1.yml
v2:
  job:
    # cluster attribute here is the same as the filename
    cluster: "cluster_1"

It’s very important here that the cluster attribute here is the same as the filename. In your 3 cases, here’s where they come from.

  1. bsub -m cluster_1 comes from the v2.job.cluster attribute. Any bjobs or bsub command will use the -m option with this string if it’s there (it wasn’t before, that’s why things are broken for you).
  2. cluster_id is the an OOD reference, and it comes from the filename. It’s how OOD keeps track of clusters.
  3. Combination of the two. OOD will read cluster_id from the file as cluster_1 and will configure itself from /etc/ood/config/clusters.d/cluster_1.yml. This is the relationship between the cluster_id and the filename. In this yml file it’ll read v2.job.cluster as cluster_1 so it will run a bsub command with -m cluster_1 because that cluster configuration is available.