In the workflow for the launcher scripts, what is the best point to trap and generate an error message for the user. Up until the point of the script getting a scheduler allocation and launching, we have all see the red box up at the top.
For all our various launchers, jupyter, matlab, etc. I would like to codify and test for all the typical things that can go wrong on an HPC cluster. Usually these are periodically monitored by something like nagios, icinga but interactive users (ssh or OOD) seem to notice things sooner. Testing/trapping errors will give a better experience.
For example, if a software repo is offline, license is unavailable or full, home directory is full, parallel file system out of quota, path unavailabe, etc. could be tested for and an appropriate message output. Anything that would prevent the app from launching or if it does, not working after launch.