Opposable Thumbs

Running Cloudera Hadoop VM on VirtualBox

So, I’ve been keeping an eye on Hadoop for a while now. The whole idea of MapReduce is pretty cool and the fact that there is a free implementation of it is even cooler. Of course, as a busy developer, not to mention my many other duties as sysadmin, build engineer, etc. that seem to come into play on a daily basis, I haven’t had time to really dig into it.

Then Hadoop got even more attention this last week when Amazon announced their new Elastic MapReduce offering. Basically, they give you a nice API for quickly spinning up new EC2 instances running Hadoop. You upload your data to S3 and point your jobs at them. It’s brilliant really – I’m still trying to find an excuse to play with it.

Anyway… I was getting to something here… what was it? Oh yes, Cloudera. The Cloudera folks have very generously produced this excellent series of tutorials on running and developing for Hadoop. As a part of all this, they also created a rather spiffy prebuilt Hadoop VM to play with along with the tutorials. (The VM, by the way, is available for download along side the tutorials – I’m not going to link directly to it here in case my link goes out of date.)

There’s one problem with all this – they built the VM as a VMware image. I don’t really care for VMware much these days (long story, nothing personal, just my own preferences). I run KVM on my workstation and VirtualBox on my laptop. Both are really excellent options and I highly recommend them. But in order to work through their examples, I really wanted to spin up this VM and see what it was all about.

Thankfully, VirtualBox is very good at running VMware virtual machines. Here’s how to get it working:

  1. Obviously, the first step is to download and extract the archive somewhere.
  2. Once you’ve got the VM somewhere convenient, start up VirtualBox and go to the ‘File > Virtual Media Manager’ menu. From here, you can click ‘Add’ and point it at the file ‘cloudera-training-0.2.vmdk’ in the folder where you extracted the Cloudera archive. Then close that window.
  3. Back in the main VirtualBox interface, click the ‘New’ button to create a new VM. Select the OS as ‘Linux’ and version as ‘Ubuntu’. Set the memory to 512MB. When you get to the boot disk screen, it should have auto-selected the new drive you added earlier. If not, select it and hit ‘next’, then ‘Finish’.
  4. Select the VM you just created and click the ‘Settings’ button. Select ‘Network’ in the settings control panel and go to the MAC Address field. You’ll need to open the file ‘cloudera-training-0.2.vmx’ back in the folder you extracted earlier. Find the line labelled ‘ethernet0.generatedAddress’ and grab the MAC address that’s given there. You’ll need to remove the colons and then insert that into this field. This might not be strictly necessary, but it makes sure the linux image still has the MAC address it expects.
  5. Next select the ‘General’ control panel and then the ‘Advanced’ tab. Check the ‘Enable PAE/NX’ box and then click the ‘OK’ button.

That’s it. You should now be able to run the VM just fine. You might want to install the Guest Additions package, too, but you’ll need to do a quick “sudo apt-get update; sudo apt-get install build-essential linux-headers-2.6.27-11-server” in the VM first. Good luck and happy MapReducing!