Recommendations for multiple clusters

We are deploying a new cluster, and the filesystem will be completely separate from the old cluster. We currently have OOD deployed on the old cluster and plan to deploy it for the new cluster as well.

What do you recommend for deployment in this situation, where the filesystems are separate? One OOD instance where the host is connected to both filesystems? Or separate instances of OOD?

I understand that both of these are viable options, but am more curious to hear others’ experiences doing one or the other in production.

Are you talking about HOME directories or NFS shared folders like project and scratch?

Home, project, and scratch directories all on the same filesystem per cluster. In other words, between the two clusters, there will be no shared file paths or overlap in terms of storage location.

Does that answer your question? Or did I miss it?

Yea that’s exactly it. If they have no shared folders, I’m not sure if you can run 1 single instance (or I may lack the imagination for a solution).

You can set the OOD_DATAROOT to be something other than ~/ondemand/data, but it seems like in this case you’d need a different setting per scheduler. Meaning if you could access both schedulers (A and B) and file systems 1 OOD, jobs that ran on scheduler A could read and write files to the NFS data root, but not jobs ran on scheduler B because the shared filesystem isn’t really shared.

Though, I believe SDSC (San Diego Supercomputer Center) may a setup like you’re describing - so they may have some insight into this. In fact, this may be more common that I’m thinking so someone else may be able to shed some light on their setup.

1 Like

@jeff.ohrstrom Apologies for re-raising this thread, but this is an area we’re interested in at our site, too. Am I not correct in remembering that OSC has more than one cluster itself? (Three, I think?)

Do they all have a common filesystem for this purpose? How much other infrastructure do they share? I remember seeing in a demo, for example, that the jobs view can show jobs from all the clusters at once.

No issues @ikirker - it’s not solved yet!

But yes, OSC runs 2 Slurm clusters (we’ve been in flux for the last 1 year or so migrating from Torque so we had times where we had 1 Torque and 3 Slurms), but they share all the file systems. I think that’s the issue here - different clusters with completely separate file systems, which we do not have at OSC. We have HOME, scratch and project directories that are accessible from any clusters (even when we were migrating the scheduler).

For me, this settles it: We have two separate filesystems, so we’ll have to use two separate OOD instances.