I’m loving Seahorse, a GUI frontend for Spark by deepsense.io. The interface is simple, elegant, and beautiful, and has the potential to significantly speed up development on a machine learning workflow by its drag-and-drop nature. Thus far I haven’t run into any major bugs that affect the results so naturally that shoots it near the top of my list for “useful things.”
That said, the lab cluster here at Brilliant Data is a highly-available Cloudera CDH installation, and at first getting Seahorse to work was a frustrating experience. Turns out this is due to a bug in how Cloudera handles configuration files in HA environments (see this link for details).
The fix is simple enough, but not all that intuitive.
When you first install Seahorse the instructions ask you to place copies of your Hadoop configuration files (yarn-site.xml, core-site.xml, etc) into a directory called “MyYARN” in your Seahorse home folder. Many people would think to copy those from the cluster using scp or something similar, but this can backfire if you pull them from the wrong directory (and yes, there is a wrong directory in a Cloudera HA cluster.)
First, open Cloudera Manager and go to download page for the version of the config files that Cloudera uses by clicking on the menu next to the cluster name and selecting “View Client Configuration URLs.”
That will pop-up a list of the options. We want the YARN config files so select those, which by default will download the zip file to the Downloads folder on the client machine.
Use your favorite method to transfer the zip file to the machine running Seahorse, unzip the file, and transfer only the .xml files to /<seahorse_home>/data/MyYARN. Restart the Docker container for Seahorse and everything should work fine now.
Have fun with your new GUI for Spark.