It’s 4am local time and I can’t sleep. I’ve been tossing and turning thinking about how I’m going to crunch some “Big Data” I have on my small cluster of 3 machines (+a dedicated NAS).
Ideally I’d like to deploy Mesosphere DCOS and run Spark with Scala, however Mesosphere DCOS isn’t available to the public yet outside of the “Enterprise Edition” which the sales team said they were only releasing to “Select Customers”.
With that in mind, I settled on running Mesos on Ubuntu Server for now, mounting the NAS on each node to give them access to the data.
However, I realized that manually configuring 3 servers by hand would be time consuming and there had to be a better way.
Enter MAAS: MAAS is Canonical’s system for managing physical servers as if they were VMs. This sounded perfect for my needs as it would allow me to quickly setup and blow away servers as I wanted to try different configurations.
For reference - the network topology is (Internet) -> (Router/Switch Combo) -> (port 1) -> NAS, (Port 2) -> (Switch) Ports (2,3,4) -> Discovery-(1,2,3) and 5th port to a random desktop.
Getting started was surprisingly easy - since the NAS runs Ubuntu Server 14.04 already, I just needed to apt-get install maas
to install the MAAS region manager (which provides the API) and cluster manager (which allows the region manager to manage the physical servers)
The first minor surprise was the URL is case sensitive, instead of http://<ip>/maas
it has to be http://<ip>/MAAS
OK with that hurdle crossed, I was ready to rock.
Next was configuring my home router (A Archer C7 v2 running OpenWRT). Since It manages DHCP for the home (vs letting MAAS handle that) I had to edit the /etc/config/dhcp
file to point to the NAS like so
config boot linux
option filename 'pxelinux.0'
option serveraddress '192.168.1.227'
option servername 'ubuntunas'
after updating the config, I restarted dnsmaq, but trying to boot via pxe on discovery-1 failed due to incorrect filename. Turns out you have to install some boot images on MAAS before you can PXE boot from it.
With that done, I could boot discovery-1 via PXE. The process for enrolling the server was a little unintuitive, first I had to PXE boot it, then after seeing some output on the screen, it turned off. It turns out that the first stage just collects some data that will now be shown in the MAAS dashboard.
OK now there was a new machine called maas-enlist.lan
. Clicking it gave me a bunch of options, the interesting one being “commission”. Clicking Commission powered on discovery-1, automatically booting it via PXE, and similar to the first PXE step, it shut off automatically after showing a bit of scrolling information on the screen. However, now the dashboard knew a bit more about maas-enlist.lan
such as the number of CPU cores, the amount of ram and the size of the harddrive. Clicking maas-enlist.lan
again now gave me two new options, “Commission” had been replaced with “Acquire” and “Acquire and Start”
Having read a little about MAAS I assumed that Acquire meant “assign this machine to me”. and clicking Acquire and Start reminded me to configure an SSH key so I can access the machine once started, then after doing so clicking the button again, kicked off the Ubuntu install.
Great, 1 machine down, ~1 hour spent so far. So I went on to configure discover-2 and 3 when I ran into my second surprise, discovery-2 “failed to enrol” after the first PXE boot.
I noticed that in the scrolling information about the machine it had mentioned the machine’s hostname was maas-enlist.lan
, which was the same as discovery-1s host name. Assuming that the issue was the hostname conflict, I set out to give each machine a static lease so that DHCP would set the host name. Looking at the same openWRT DHCP page as before it was pretty easy, I just added this to the dhcp config file
config host
option ip '192.168.1.208'
option mac 'd0:67:e5:xx:xx:xx'
option name 'discovery-1'
config host
option ip '192.168.1.212'
option mac 'd0:67:e5:xx:xx:xx'
option name 'discovery-2'
config host
option ip '192.168.1.229'
option mac 'd0:67:e5:xx:xx:xx'
option name 'discovery-3'
(I just kept the IPs that had been assigned via DHCP already)
After doing so and restarting dnsmaq again, and booting discovery-2 via DHCP again, it got passed the enrolment state, Sweet!
Unfortunately, I hit another snag here. Hitting “commission” didn’t boot discovery-2 like it had for discovery-1. And after a lengthy timeout period, it reported it was unable to query the BMC. OK I clicked discover-2
to see if it told me anything interesting and it complained it wasn’t able to connect to the cluster manager. OK long story short it had picked up the wrong network interface(s) - it was configured to use dockers virtual network and QEMU’s virtual network, in addition to the real physical network. OK deleted those from the cluster and saved and it’s connected again. discovery-1
is still deploying at this point which seems ridiculously long so I switch the monitor to it to see how it’s going and it’s stuck in a loop trying to retrieve the install image from http://172.42.17.1 (docker virtual address)
Fast forwarding a little here, I deleted discovery-1 from MAAS, did the whole PXE boot, commission and acquire dance again and this time it looked like it was working.
Great back to discovery-2 again, tried to enrol it again, and got the same result. Poking around on the edit instance page I noticed it had supposedly configured the BMC on my behalf, and it said the BMCs IP address was 192.168.1.120. OK nothing odd about that I don’t think, that was the same IP address I saw when booting this node. Checking discover-1’s instance config it’s BMC had an IP ending in .238, again that seemed OK, but I realized that I didn’t see a .120 entry in DHCP but I did see a .238 entry in DHCP (with the host name idrac) and realized that for some reason the BMC on discovery-2 hadn’t been configured. Rebooted it, jumped into the config and yep, not only was it set to static IP, it had “IPMI over lan” set to “off”. Changed both of those values, did the dance and ah yeah discovery-2 was ready for Acquisition. After making the same change to discovery-3 and doing the dance it was ready as well.
“Acquire and Start”ed both machines, waited about 4 minutes and they had both been deployed. SSHing into discovery-{1,2,3} gave me the familiar welcome to ubuntu login prompt.
And with that, I now have 3 machines I can do what I want with and easily re-provision as needed.
If I were doing this again, I’d probably connect the second interface on UbuntuNAS to the switch the servers are on and use the automatic enrolment tool since anything on that switch/interface is a server I’d want available for deployments. This would remove a bit of friction, especially if I was doing a large number of machines.
In the next instalment, I’ll talk about provisioning each machine with Mesos, Spark and HDFS.