Vagrant –> Docker –> Spark !

Gluster

2014-02-18

Vagrant lets you reproduce a pure linux box, anywhere, automatically, but it comes at the cost of provisioning a brand new VM, which takes a while. Docker, on the other hand, gives you a lightweight framework for running containerized, layered software stacks, by using linux containers. Using vagrant to spin up your docker instances gives you a completely reproducible way of rapidly spinning up software stacks.

Spark?

I’ve been wanting to create a reproducible, hackable spark environment for a while now. However, given the complexity of maintaining hadoop distros, I figured it would be overkill to spend all day reading the spark documentation. Especially since I’m probably going to try to put gluster underneath it :), and so I will want the setup to be hackable.

Normally I’d try this in vagrant, but I’ve already been cheating on vagrant lately with libvirt. So I might as well go all the way of the deep end and try docker…

I love vagrant: but I don’t have the packer chops to create boxes easily (yet).
Anyways, all the cool folks are using docker nowadays
I want to test larger clusters of 5-10 nodes. I doubt my laptop can handle that with full VM setups.

The folks at berkeley’s amplab have been maintaining docker recipes. Im on a mac… how can I spin up linux containers on my Mac?

With Vagrant !

Here’s how this will work:

So SPARK will run in hosted linux containers, which are created via docker , running on a CENTOS or FEDORA VM which is running as a GUEST inside of my Mac OS X box

Vagrant is like the giving tree that keeps on giving even after I abandoned it 🙂 … So here’s how this will work.

– First, we will spin up a local CENTOS or FEDORA box using “vagrant up”. That will be our “host”. Actually, it is a “host” from docker’s perspective.
– Then, we will install docker and do couple of minor things, like start the docker services.
– Then, we will pull down the github repo containing docker utilities for setting up spark.
– Finally: We will run the provisioner script which will download all the docker containers and create the spark containers for us.

Lets get started.

PART 1: Spin up a VM that has LXC support.

On my mac… So I need a base linux box to run LXC’s on. So here goes. You can use “vagran box list” to list all your base OS’s, and see if anyone fits the bill (if not, you can just add a box yourself using ‘vagrant box add’, from http://vagrantbox.es/).

jays-MacBook-Pro:sandbox_docker Jpeerindex$ vagrant box list
base                     (virtualbox)
base-hadoop              (virtualbox)
centos-6-vbox            (virtualbox)
fedora-19C               (virtualbox)
vagrant-fedora19B        (virtualbox)
vagrant-fedora19B2       (virtualbox)

Alright ! Now we’ve got a centos-6-vbox instance sitting around. Lets try it out.

vagrant init

#Now, modify the configure section in Vagrantfile to look like
    #this
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
          config.vm.box = “centos-6-vbox”

And now, bring up your box:

vagrant up
vagrant ssh

PART 2: Set DOCKER up on your VM

Once the box is spun up, run “vagrant ssh”, and become root — su (password is vagrant), and do this:
[root@localhost vagrant]# yum install http://ftp.riken.jp
/Linux/fedora/epel/6/i386/epel-release-6-8.noarch.rpm

[root@localhost vagrant]# yum -y install docker-io git

[root@localhost vagrant]# yum install docker-io

[root@localhost vagrant]# service docker start

Okay ! Now we’ve got a centos-6-vbox instance with a docker instance running. Lets do something useful with it… LIKE SETTING UP A SPARK CLUSTER. So lets see if we can pull the bits in from http://www.boardmad.com/2013/11/08/got-a-minute-spin-up-a-spark-cluster-on-your-laptop-with-docker/….

   git clone https://github.com/amplab/docker-scripts.git
   cd docker-scripts
   # The above blog has a branch for this spark docker recipe, so switch to it.
   # the post also used -blogpost- branch, but that failed for me, so I just used master… just being explicit here.
   git checkout master
   # We will need this for spark nameserver script
yum install bind-utils
   # Start up docker before we try to deploy…
   service docker start

Okay. Now we do the magic install command:

    #The “dig” command is required for spark nameserver setup script
yum install bind-utils
    #Do this as sudo, edit sudoers if you have to.
# vi sudoers to allow “root” to do sudo operations.
    sudo ./deploy/deploy.sh -i amplab/spark:0.8.0 -c

And we wait. I had to run this a couple of times to get it to work. For example, networking went down during the process in my VM, and that gave me some cryptic errors… Also the “dig” command was not found (but that is solved above by yum installing of bind-utils. You should now see something like this.

21fa12d13aa7: Pulling dependent layers
8dbd9e392a96: Download complete
1346666e8c33: Download complete
db12ca675b07: Download complete
9deb2dda52f2: Download complete
e58fb78373de: Downloading [===========================> ] 75.04 MB/136.6 MB 3m33s

Whats happening above? Each layer is being downloaded for the spark shell from the remote docker image repositories.

After a while, you’ll get dumped right into your scala shell where you can run spark commands:

Using Scala version 2.9.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_03)
Initializing interpreter…
Creating SparkContext…
14/02/18 10:48:32 INFO Slf4jEventHandler: Slf4jEventHandler started
14/02/18 10:48:32 INFO SparkEnv: Registering BlockManagerMaster
14/02/18 10:48:32 INFO MemoryStore: MemoryStore started with capacity 510.4 MB.
14/02/18 10:48:32 INFO DiskStore: Created local directory at /tmp/spark-local-20140218104832-724c
14/02/18 10:48:32 INFO ConnectionManager: Bound socket to port 39507 with id = ConnectionManagerId(shell32123,39507)
14/02/18 10:48:32 INFO BlockManagerMaster: Trying to register BlockManager
14/02/18 10:48:32 INFO BlockManagerMaster: Registered BlockManager
14/02/18 10:48:32 INFO HttpBroadcast: Broadcast server started at http://172.17.0.18:34870
14/02/18 10:48:32 INFO SparkEnv: Registering MapOutputTracker
14/02/18 10:48:32 INFO HttpFileServer: HTTP File server directory is /tmp/spark-c1fd8306-8581-40e9-a5d0-cca27d8b2bca
14/02/18 10:48:33 INFO SparkUI: Started Spark Web UI at http://shell32123:4040
14/02/18 10:48:33 INFO Client$ClientActor: Connecting to master spark://master:7077
Spark context available as sc.
Type in expressions to have them evaluated.
Type :help for more information.

scala> 14/02/18 10:48:34 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140218104834-0000
14/02/18 10:48:34 INFO Client$ClientActor: Executor added: app-20140218104834-0000/0 on worker-20140218104745-worker1-48080 (worker1:48080) with 1 cores
14/02/18 10:48:34 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140218104834-0000/0 on hostPort worker1:48080 with 1 cores, 800.0 MB RAM
14/02/18 10:48:34 INFO Client$ClientActor: Executor added: app-20140218104834-0000/1 on worker-20140218104749-worker2-59395 (worker2:59395) with 1 cores
14/02/18 10:48:34 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140218104834-0000/1 on hostPort worker2:59395 with 1 cores, 800.0 MB RAM
14/02/18 10:48:39 INFO Client$ClientActor: Executor updated: app-20140218104834-0000/1 is now RUNNING
14/02/18 10:48:40 INFO Client$ClientActor: Executor updated: app-20140218104834-0000/0 is now RUNNING
14/02/18 10:48:42 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka://sparkExecutor@worker2:36207/user/Executor] with ID 1
14/02/18 10:48:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager worker2:34632 with 510.4 MB RAM
14/02/18 10:48:50 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka://sparkExecutor@worker1:57318/user/Executor] with ID 0
14/02/18 10:48:50 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager worker1:39919 with 510.4 MB RAM

scala>

Vagrant –> Docker –> Spark !

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...