Fresh OOD 1.8 with Dex installation on RHEL 8.2 with PBS. Users can login to OOD. Shells can be accessed via OOD. From the command line (via OOD and regular terminal), qstat works for root and for users.
When I try to start a job via an interactive session app, I get this error in /var/log/ondemand-nginx/user/error.log. The IP address is correct. The error shown to the user in browser is essentially the same.
App 2078 output: [2020-09-17 12:06:23 +1000 ] ERROR "ERROR: OodCore::JobAdapterError - Warning: Permanently added 'pbs.domain.com.au,129.x.x.x' (ECDSA) to
the list of known hosts.\r\nuser@pbs.domain.com.au: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased)."
I found this post which encouraged adding the keys to /etc/ssh/ssh_known_hosts. Having now done that for the host pbs.domain.com.au, I just get the balance of the error:
Iâm looking at my working system (OOD 1.7, CentOS 8.2, Basic Auth via PAM and SSSD) and Iâm wondering if my new set up is failing because SSSD translated through to the cluster - itâs the same auth system we use there. Dex is configured against the same AD, but the cluster doesnât know about Dex. I presumed in my mind that it wouldnât need to - but maybe it does?
I donât think this has to do with auth or PAM or SSSD or Dex.
First I want to make sure this is expected. That is, that you expect OOD to ssh into pbs.domain.com.au and submit the job from that node and not from the OOD web server node (some folks do this so they keep env and binaries on another host, they ssh into this host and run qsub and so on).
If that is the case - that you want to ssh into another node to execute the qsub command - then you probably need to add host based authentication between OOD webserver and this remote host. Otherwise every single user will have to generate their own keys and thatâs a big pain. There are lots of resources on the web on how to do this, just google âhost based authenticationâ. You already allow it (because hostbased is listed there) now you just need to setup trust by creating and adding the keys.
If this is not the case - that you want to execute qsub on the OOD web server - then you need to checkout what youâre doing in bin_overrides (in the cluster.d config file) with a wrapper script that seems to be sshing. Or you could be using submit_host.
@jeff.ohrstrom we are using submit_host because I thought it was necessary. We have two ssh_hosts. I canât remember exactly where I read it, but I think I was having problems getting ssh to work because we have multiple login nodes using a DNS round robin, hence the introduction of those values within the cluster.d config.
I have just successfully executed qsub -I from the command line of the new OnDemand server, and it was fine.
OK I see. Youâre trying to use linux host adatper? With that adatper, yes you needsubmit_host and ssh_hosts because it works over SSH. Itâs not the PBS batch scheduler, itâs just sshing to a login node and starting processes.
So: If youâre trying to use this adapter, then youâll need host based authentication. You need this to allow users to ssh freely from the OnDemand server to your login hosts without being prompted for passwords. You could require everyone generate ssh keys without passwords, but that seems over-burdensome on your users.
If youâre not trying to use this adapter, if you want to schedule though the PBS server, then you donât needsubmit_host if you have all the PBS binaries installed and setup on the OnDemand server.
To sum up and for clarity:
submit_host is required for the linux_host adapter because it works over SSH and needs to know where to ssh into. Because this host is likely a virtual name, it also needs a list of ssh_hosts to poll to get the status of the âjobâ it submitted (âjobâ here in quotes because it just shelled somewhere and tmux backgrounded a process group).
submit_host for all other adapters in 1.8 and beyond is when a site doesnât want to install a batch schedulers libraries on the OnDemand server. So instead of running qsub, qstat, sbatch, squeue or any other PBS/Slurm/Torque/etc command on the OnDemand server; we instead ssh into a login node an run the command there where all the libraries are installed and configured.
Ok, so like a previous problem I had, I donât think this answer nails my problem, but it did indicate what I had done wrong.
I have adapter: "pbspro" so I commented out the submit_host and the ssh_hosts, restarted everything and it just worked. I donât remember why I had added them to my dev installation, but it didnât stop things from working so I presumed they were necessary. Once prod is up and running, Iâll remove them from dev and see if it still works. (because, obviously, until prod is in production, itâs dev. And while prod is in dev, dev is in prod. Everything here is normal, nothing to see here.)