TensorBoard for Open OnDemand

Has anyone gotten TensorBoard to work with Open OnDemand? It’s the visualization interface for TensorFlow. I have a staffer who’d like to use it for classes and notice it’s not on the list of apps that have been ported so far.

TIA - Susan Litzinger
PSC

Hi Susan,
Yes, still in developoment but mostly working where it allocates a node, starts TB with the log you specify and allows you to attach to the TB web server. Currently has a small problem where the link given after it allocates the node does not browse to the tensorboard server correctly. But you can still connect then if you put in the servername:port in the browser manually. If you would like a copy I can post the current draft with this caveat.

Louis

Hi,

I would love to see the draft.

Thanks,
Bob

Hi Bob,
Current version is here:

Louis

1 Like

@lcapps thanks for sharing!

Hi Louis,

I very much appreciate your offering up your version of TensorBoard for Open OnDemand. However, I get hung up on step 1. The directions tell me to download this file:

wget https://gitlab-master.nvidia.com/lcapps/cluw/-/archive/master/cluw-master.tar.gz

but it gitlab-master.nvidia.com does no longer exists and I have google every way I can find for something like NVIDIA & cluw and nothing is coming up. Do you have an alternate location for the file? Thanks in advance.

Susan

Louis - we were able to get it working on our OnDemand instance. Will be interested in hearing if you’re able to get the website to draw up eventually, rather than having to connect to the node using an SSH tunnel. Thanks for sharing!

Dori
UB CCR

@dsajdak and @lcapps we just asked one of our interns to look into setting up Tensorboard as a web service. If we have success we’ll post an example of how we did it like the RStudio and Jupyter examples.

1 Like

@dsajdak and @rodgers.355, this is great to hear. Did not get a chance to understand why the web service link does not work correctly so will be good to get it working.

Easiest way to do this is to use the jupyter notebook and jupyter-tensorboard python app. TensorBoard shows up as a kernel in the JN pulldown.

Hi there!

We just recently pushed our Tensorboard OnDemand app: https://github.com/stanford-rc/sh_ood-apps/tree/master/sh_tensorboard

It’s based on a native installation of Tensorflow, loadable through a module system (we use Lmod). We don’t use containers for that, but that part can easily be customized.

One particularly interesting feature of that Tensorboard OnDemand app (for me at least! :)) is that it implements an authenticating reverse proxy. Because Tensorboard doesn’t provide any kind of authentication mechanism for its web interface, on a shared environment, anybody knowing the hostname and port number of a running Tensorboard instance can connect to it.

To mitigate this, we implemented an authentication mechanism that basically sets a browser cookie in the OnDemand interactive app page (the “Connect to Tensorboard” button does this) which is then checked by the authenticating reverse proxy that controls access to the Tensorboard web interface. Without that cookie, access to the Tensorboard web interface is refused. And if the cookie is ever lost, users can re-create it by visiting the “My Interactive Sessions” page and clicking the “Connect” button again.

It’s been running in production for some time now on our Sherlock cluster and seems to be working fine for us.

If anyone wants to give it a try, please don’t hesitate to let us know how it goes!

1 Like

That is a really cool simple solution for an authenticating reverse proxy. We did a similar approach with https://github.com/OSC/bc_osc_example_shiny for launching a Shiny app, but used OpenResty (NGINX) via Singularity and started the Shiny app listening on a Unix socket. The result was far more complex.

Hi Kilian
This is looking good, thanks. However I tried to use the twisted pip/conda packages as an alternative to installing rpms, but i had errors with

 >>> from twisted.web.error import ForbiddenResource
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'ForbiddenResource'

I noticed that ErrorPage, NoResource and ForbiddenResource in twisted.web.error were deprecated since 9.0 and are removed since twisted web v12 (in 2012), so will be problematic for recent versions.

Hi Kilian,

First thanks for sharing the TensorBoard OOD App. I am porting it on our clusters per our user request. The App is now working under the YCRC OOD environment, except one problem - the call graph display area in the ‘GRAPHS’ tab is not showing properly. It is way too small and cannot be enlarged. However, if I log onto the node where the TensorBoard server is running and view it locally on that node, the ‘GRAPHS’ display properly.

I am attaching the two screen shots to show the difference. Not sure if you have ever seen the same problem.

Best,

Ping

Figure 1: the ‘bad’ graph as viewed from OOD

Figure 2: the ‘good’ graph as viewed from the compute node where the tensorbaord server is running

Hi Sbutcher,

They use an older version of twisted which is provided in Python 2.

I resolved this using a virtual environment with Python 2. First create a directory called lib in tensorboard/template. Then create a virtual environment in tensorboard/template/lib/.venv. Activate the virtual env and then pip install twisted. Now we need to use this virtual env in before.sh.erb. Simply add this line at the beginning of before.sh.erb:

     source lib/.venv/bin/activate

@kilian . I’m trying this out right now (using OOD 2.0.13) and am having a few issues getting connected through the web interface. I just receive a 403 message saying “Forbidden Resource: Sorry, resource is forbidden”

It’s quite possibly my changing of some of the code caused this, but I was having an issue similar to the issue that @sbutcher had with the ‘ForbiddenResource’ module by modifying bin/authrevproxy.py:

#from twisted.web.error import ForbiddenResource
from twisted.web.resource import Resource, ForbiddenResource

When I launch the job through OnDemand, it looks like the authrevproxy start up now, and tensorboard also starts, so it looks like that part is good to go. I have tensorboard running on a 127.0.0.1 port (and I can connect to it from the local system). There’s a proxy port as well and that is listening. The output log doesn’t show anything significant (to me). It shows that the servers are listening on their ports and some CUDA warnings, which I’m not worried about, and the last line is the tensorboard startup message.

The URL that’s created by ondemand after the app launches is in the format:
https ://ondemand/rnode/host01/4168/

I don’t know where to find any other errors as to what’s throwing the forbidden resource. Does anyone have an idea?

Thanks.

I fixed this by swapping the authproxy.py from the OSC app with the one in the Stanford one.

Here is my working version, tweak as needed: GitHub - mjbludwig/tensorboard_ood

I still get forbidden resource. I’m not sure I’ll spend much more time on it. There’s probably some other issue buried somewhere that I can’t find. Thanks though.

Since we’ve upgraded to OOD 2.0.31 from v1.8.x, tensorboard doesn’t work (403 Forbidden) using the authrevproxy approach. Does anyone have tensorboard working with OOD 2.x that they can share with us please?