Dashboard Crashing after attempted modifications made for SSL support, and Cluster Config Files

Following the instructions on the IO for “Add SSL Support”, and “Add Cluster Configuration Files” I ran into issues with the web interface loading into the dashboard.

As you can see it’s an issue with this method “fetch” in the clusters.rd file - which I coincidentally have not touched myself. Passenger had successfully started but there were issues with starting the web app seemingly based off of this file.

I had tried reverting any changes I had made while doing so but to no avail. As a result I attempted to go down the path of the errors that had started with the one shown on the clusters.rb file. Tried commenting out the problem lines,

to see if there would be any changes but it didn’t end up fixing the file. Similar to my previous issue’s solution - I tried deleting the file, and trying to go down the error path from deleting that and it didn’t fix it either. I’m left rather confused – Thanks for help!

It may be a malformed cluster config file.

What do the cluster config files in /etc/ood/config/clusters.d look like?

Also, does your user have read access to these files, and rx on the /etc/, /etc/ood, and /etc/ood/config?

I’m not sure yet in what case YAML.safe_load(p.read) would return nil but one case is if p.read returned "".

How would a malformed cluster file happen - just out of curiosity? If it was out of my control that’s fine but good to know how to prevent it if possible. There was no clusters.d directory since it has been deleted due to the only file in it being my cluster config file that was incomplete/empty based off issues.

drwxr-xr-x. 89 root root 8192 Feb 6 09:49 etc
drwxr-xr-x. 3 root root 20 Oct 15 11:46 etc/ood
drwxr-xr-x. 2 root root 140 Feb 5 18:48 config

I share the a similar conclusion to the nil return usually that means literally nothing correct? not even a space, " " - cause that’d still be registered.

So that makes sense. If you had a clusters.d directory with a file that was empty, then you would get the result you saw. The reason is that with an empty file the method Pathname#read would return an empty string “” and YAML.safe_load "" returns nil. We should obviously handle this crash, so I’ll open an issue to track that bug.

We do not have any automation to generate the cluster config files so the only way you would get an empty or malformed config is if someone added it. When I say malformed, I mean the file is empty, or its missing the v2: at the beginning, or there are parts of it that are improperly indented, etc.

Just to make sure there is this clarification, before clusters.d/cluster.yml was deleted I’m very sure it was similar to,

# /etc/ood/config/clusters.d/my_cluster.yml
---
v2:
  metadata:
    title: "My Cluster"
  login:
    host: "my_cluster.my_center.edu"
  • but with my clusters ip info*

As stated on the IO but when that didn’t work, I believe I commented it out - which didn’t work. I had then deleted it (as stated in previous post) since it was evidently erroneous in that state.

As I attempted to fix the issue by removing both (seemingly erroneous) cluster files,

/etc/ood/config/clusters.d/cluster.yml (as stated in previous post)

/var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core/clusters.rb

It did not in fact fix the issue and it lead to a trail of more errors starting with

Then,

With the last one resulting in a 500 Internal Server Error. Is it in my best interest to reinstall? There wasn’t much done on this install since I never got past the SSL step in the IO procedure due to the adding of the cluster configuration files resulting in the crashing of the dashboard.

You would need to restore /var/www/ood/apps/sys/dashboard/vendor/bundle/ruby/2.4.0/gems/ood_core/clusters.rb since that defines a class that the rest of the apps use.

That cluster config looks good, I’d add that back and debug further. Do you see the same error as above when that cluster config file exists?

After running “yum remove ondemand-*” for a clean uninstall and then following the IO for a clean install I had reinstated the cluster config exactly as stated above, and still got a 500 internal sever error.

The /var/log is similar as before but a little different (I’m guessing because this error is off a clean install without me doing any debugging modifications yet)

@efranz is anymore log data needed to help the troubleshooting? From what I’ve found there’s not really a lot of informative data to gather about it for troubleshooting the issue. This is unless I’m just missing some places where more error logs could be…The error message I had posted was the only real substantial error log compared to the logs when it was functioning without the cluster config file. As far as the error page on the site goes, with the cluster config file added in - it’s just a plain text 500 internal server error which is way less informative than the passenger error pages.

I took a peak at the mentioned ticket #150 and then ticket #152 mentioned on that one, tried to look around elsewhere on the git or discourse, and didn’t gather too much from them - it seemed like a similar situation. I’ll try fleshing out the config file to have more than the bare minimum that I had mentioned above to see if that will change the scenario. This is since initially the error and the crashing was from an empty file, and a near empty file. For example I’ll be adding the job section for job mapping, and then attempting to add in LDAP support after I attempt to add SSL support since I just got my certs. I’ve just been hesitant to do so since the dashboard crashes after just having the meta data and login content alone in the cluster config file.

That error message is because of bad yaml parsing. Though I can only guess where it’s coming from (your cluster.d/ files). How it’s propagating up to this level I can’t say and is kind of worrisome. Do you have any applications you’re also deploying? That could be also be the cause, the manifest.yml files of the apps you’re deploying.

Can you validate your cluster yml files (and/or manifest.yml)? You can try at the link below. If they are invalid - can you keep them and supply them here (keeping the current formatting). I’d very much like to be able to reproduce these issues.

This is the Cluster Config file validation. This made me realize there may have been some discrepancy copying from the IO and pasting into VIM.

I’m currently not deploying any other apps on this cluster except for OOD. I guess this makes sense that there was no manifest.yml in my cluster.d/ files correct? I did a which and a find on manifest.yml and nothing came up.

Yea manifest.yml are in applications (in /var/www/ood/apps/sys/) so as long as you’re not modifying the apps that ship with it or adding new apps then you shouldn’t be running into the problem.

I was able to verify for sure that mapping values are not allowed in this context is for sure a yaml thing and I was able to reproduce by removing colons like.

Like this would trigger such an error.

foo:
  missing_colon
  but_this_is_ok:
  - array_entry_1

After correcting the cluster config with the YAML validation I was able to add the cluster config file in without a 500 internal server error! Thanks for the tip on the validating - I obviously do not know YAML syntax too well.