OOD 1.8, apps aren't loading, looks like ssh challenge error?

Fresh OOD 1.8 with Dex installation on RHEL 8.2 with PBS. Users can login to OOD. Shells can be accessed via OOD. From the command line (via OOD and regular terminal), qstat works for root and for users.

When I try to start a job via an interactive session app, I get this error in /var/log/ondemand-nginx/user/error.log. The IP address is correct. The error shown to the user in browser is essentially the same.

App 2078 output: [2020-09-17 12:06:23 +1000 ] ERROR "ERROR: OodCore::JobAdapterError - Warning: Permanently added 'pbs.domain.com.au,129.x.x.x' (ECDSA) to 
the list of known hosts.\r\nuser@pbs.domain.com.au: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased)."

Any idea what I’ve done wrong?

I found this post which encouraged adding the keys to /etc/ssh/ssh_known_hosts. Having now done that for the host pbs.domain.com.au, I just get the balance of the error:

user@pbs.domain.com.au,: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).

Shouldn’t Dex be looking after this? In 1.7 we used basic auth, and all this was looked after for us.

I’m looking at my working system (OOD 1.7, CentOS 8.2, Basic Auth via PAM and SSSD) and I’m wondering if my new set up is failing because SSSD translated through to the cluster - it’s the same auth system we use there. Dex is configured against the same AD, but the cluster doesn’t know about Dex. I presumed in my mind that it wouldn’t need to - but maybe it does?

I don’t think this has to do with auth or PAM or SSSD or Dex.

First I want to make sure this is expected. That is, that you expect OOD to ssh into pbs.domain.com.au and submit the job from that node and not from the OOD web server node (some folks do this so they keep env and binaries on another host, they ssh into this host and run qsub and so on).

If that is the case - that you want to ssh into another node to execute the qsub command - then you probably need to add host based authentication between OOD webserver and this remote host. Otherwise every single user will have to generate their own keys and that’s a big pain. There are lots of resources on the web on how to do this, just google ‘host based authentication’. You already allow it (because hostbased is listed there) now you just need to setup trust by creating and adding the keys.

If this is not the case - that you want to execute qsub on the OOD web server - then you need to checkout what you’re doing in bin_overrides (in the cluster.d config file) with a wrapper script that seems to be sshing. Or you could be using submit_host.

1 Like

@jeff.ohrstrom we are using submit_host because I thought it was necessary. We have two ssh_hosts. I can’t remember exactly where I read it, but I think I was having problems getting ssh to work because we have multiple login nodes using a DNS round robin, hence the introduction of those values within the cluster.d config.

I have just successfully executed qsub -I from the command line of the new OnDemand server, and it was fine.

OK I see. You’re trying to use linux host adatper? With that adatper, yes you need submit_host and ssh_hosts because it works over SSH. It’s not the PBS batch scheduler, it’s just sshing to a login node and starting processes.

So: If you’re trying to use this adapter, then you’ll need host based authentication. You need this to allow users to ssh freely from the OnDemand server to your login hosts without being prompted for passwords. You could require everyone generate ssh keys without passwords, but that seems over-burdensome on your users.

If you’re not trying to use this adapter, if you want to schedule though the PBS server, then you don’t need submit_host if you have all the PBS binaries installed and setup on the OnDemand server.

To sum up and for clarity:

submit_host is required for the linux_host adapter because it works over SSH and needs to know where to ssh into. Because this host is likely a virtual name, it also needs a list of ssh_hosts to poll to get the status of the “job” it submitted (“job” here in quotes because it just shelled somewhere and tmux backgrounded a process group).

submit_host for all other adapters in 1.8 and beyond is when a site doesn’t want to install a batch schedulers libraries on the OnDemand server. So instead of running qsub, qstat, sbatch, squeue or any other PBS/Slurm/Torque/etc command on the OnDemand server; we instead ssh into a login node an run the command there where all the libraries are installed and configured.

1 Like

Ok, so like a previous problem I had, I don’t think this answer nails my problem, but it did indicate what I had done wrong.

I have adapter: "pbspro" so I commented out the submit_host and the ssh_hosts, restarted everything and it just worked. I don’t remember why I had added them to my dev installation, but it didn’t stop things from working so I presumed they were necessary. Once prod is up and running, I’ll remove them from dev and see if it still works. (because, obviously, until prod is in production, it’s dev. And while prod is in dev, dev is in prod. Everything here is normal, nothing to see here.)