Launch Scripts: Trap Errors

shawn.doughty · February 21, 2019, 2:03am

In the workflow for the launcher scripts, what is the best point to trap and generate an error message for the user. Up until the point of the script getting a scheduler allocation and launching, we have all see the red box up at the top.

For all our various launchers, jupyter, matlab, etc. I would like to codify and test for all the typical things that can go wrong on an HPC cluster. Usually these are periodically monitored by something like nagios, icinga but interactive users (ssh or OOD) seem to notice things sooner. Testing/trapping errors will give a better experience.

For example, if a software repo is offline, license is unavailable or full, home directory is full, parallel file system out of quota, path unavailabe, etc. could be tested for and an appropriate message output. Anything that would prevent the app from launching or if it does, not working after launch.

Ideas?

efranz · February 21, 2019, 8:33pm

I would need to fix https://github.com/OSC/ood-dashboard/issues/451 but then you could add a block of ruby code to an interactive app’s submit.yml.erb that did the tests (or executed an external script that did the tests) and raised an exception with an explanation.

Right now if you raise an exception in submit.yml.erb it will be unhandled unfortunately, and if you raise an exception in an erb file in the template directory of an interactive app it will display to the user but files will be copied to the user’s directory that are never used.

So until this is fixed you could place an erb file in the template directory and raise an exception. It would not be ideal for a case where home directory is full. But for that case there is always https://osc.github.io/ood-documentation/master/customization.html#disk-quota-warnings-on-dashboard.

I can add this issue to a list of 1.5 bug fixes so when we fix these we can just do a 1.5 patch release so it is easier to get the fix.

efranz · February 21, 2019, 8:39pm

Another option is a monkey patch that overrides the default BatchConnect::Session#save method https://github.com/OSC/ood-dashboard/blob/baea5558a4e55e8f7ca49599670e6731ad2da94b/app/models/batch_connect/session.rb#L137. Then you could place that in a custom dashboard initializer in /etc/ood/config and that would just affect all of the interactive apps. Monkey patches are of course brittle solutions though.

Worth stating that the solutions I’ve mentioned thus far are just making use of what is available right now or with minor fixes to enable validations on submitting the form. There may be a more appropriate way that could also leverage Rails’ built in model validations, but that would take more thought.

shawn.doughty · February 26, 2019, 8:45pm

Thanks for all the ideas on how to deal with this. I’ll try something out in my free time or if it is fixed before hand as a bug fix (or even new feature), all the better.

jeff.ohrstrom · October 23, 2019, 8:02pm

This seems to work as expected now. I was able to make this simple test where I just raise an error unless a temp file exists, so you could likely make all sorts of tests here.

<%-

  raise StandardError, "This is just a test!" unless File.exist?("/tmp/error_file")

-%>
---

Topic		Replies	Views
Input/output error with cluster access and interactive apps Get Help	3	704	May 26, 2022
Timed out waiting for Jupyter Notebook/permission denied Get Help	3	1755	May 26, 2022
Modules in Jupyter script.sh.erb file not found Get Help	4	634	May 26, 2022
Jupyter failing to start Get Help question	5	1104	May 26, 2022
OOD Rstudio users encountering HTTP 502 and 503 errors during sessions Get Help	6	2046	May 26, 2022

Launch Scripts: Trap Errors

Related Topics