Recommendations for multiple clusters

We are deploying a new cluster, and the filesystem will be completely separate from the old cluster. We currently have OOD deployed on the old cluster and plan to deploy it for the new cluster as well.

What do you recommend for deployment in this situation, where the filesystems are separate? One OOD instance where the host is connected to both filesystems? Or separate instances of OOD?

I understand that both of these are viable options, but am more curious to hear others’ experiences doing one or the other in production.

Are you talking about HOME directories or NFS shared folders like project and scratch?

Home, project, and scratch directories all on the same filesystem per cluster. In other words, between the two clusters, there will be no shared file paths or overlap in terms of storage location.

Does that answer your question? Or did I miss it?

Yea that’s exactly it. If they have no shared folders, I’m not sure if you can run 1 single instance (or I may lack the imagination for a solution).

You can set the OOD_DATAROOT to be something other than ~/ondemand/data, but it seems like in this case you’d need a different setting per scheduler. Meaning if you could access both schedulers (A and B) and file systems 1 OOD, jobs that ran on scheduler A could read and write files to the NFS data root, but not jobs ran on scheduler B because the shared filesystem isn’t really shared.

Though, I believe SDSC (San Diego Supercomputer Center) may a setup like you’re describing - so they may have some insight into this. In fact, this may be more common that I’m thinking so someone else may be able to shed some light on their setup.

1 Like

@jeff.ohrstrom Apologies for re-raising this thread, but this is an area we’re interested in at our site, too. Am I not correct in remembering that OSC has more than one cluster itself? (Three, I think?)

Do they all have a common filesystem for this purpose? How much other infrastructure do they share? I remember seeing in a demo, for example, that the jobs view can show jobs from all the clusters at once.

No issues @ikirker - it’s not solved yet!

But yes, OSC runs 2 Slurm clusters (we’ve been in flux for the last 1 year or so migrating from Torque so we had times where we had 1 Torque and 3 Slurms), but they share all the file systems. I think that’s the issue here - different clusters with completely separate file systems, which we do not have at OSC. We have HOME, scratch and project directories that are accessible from any clusters (even when we were migrating the scheduler).

For me, this settles it: We have two separate filesystems, so we’ll have to use two separate OOD instances.

Thanks,
Nick

Revisiting this thread based on some new use cases…

Is there any plan to support multiple clusters with different filesystems in the future? I don’t know how heavy of a lift this would be architecturally.

I’m thinking it would be nice to link different apps to different clusters, based on the hardware. For example, maybe you have an old cluster that you want to just run interactive desktops for classroom training, and you don’t want to run these on your main production cluster. Of course, you can always run a separate OOD instance for that, but administratively, it would be convenient to have it all behind one pane of glass so that users don’t have to go to different URLs, and so admins don’t have to maintain multiple portals.

Have you had anyone wanting to do something similar? Should I submit a feature request on GitHub?

You should always submit a feature request to github. I tend to remember github tickets much more easily than discourse topics. Which is to say - github tickets are easier to manage and keep track of for us. Discourse topics seem to get lost in a lot of noise.

That said - there is a ticket for this already. Give it a +1 and it may get bumped in priority. Though EXSEDE (or ACCESS as it will be) is trying to have an OnDemand instance that can talk to several service providers - so it’ll come as a part of that effort sometime, though I can’t say when. As you indicate it is sort of a heavy lift becuase the assumption of 1 HOME directory is sort of baked into a lot of places.

Cool, I gave it a +1. Multiple service providers sounds like a great idea. Looking forward to seeing this feature someday.

@ndusek I may have a patch you can apply to get 1 instance working with multiple file systems. Are you interested in such a thing?

And/or @ikirker - same message - I may have a patch for multiple filesystems.

@jeff.ohrstrom Yes, I would be interested in having a look. I am actually going to be doing a new deployment of OOD over the next couple weeks, so now might be a good time to test out something like this.

I’ve updated the same GH ticket. Please follow it to see any updates.