Simplified ecosystem infrastructure with Big Top.

Gluster

2013-07-09

Big Top normalizes your hadoop distributions by unified building, packaging, and integration testing of the latest distributions of tools in the hadoop ecosystem. That way you know that your various ecosystem components play nice together. For example – your Hbase version is compatible with your Zookeeper/Hadoop distributions, etc…

Big Top is a meta-project — analogized to the linux kernel for big data. Big top provides a sandbox VM, which is a pseudo distributed cluster, and also packages smoke tests for your existing hadoop cluster’s functionality with existing tools. Big top is very flexible ~ You can not only use bigtop to test existing HDFS deployments, but also, different hadoop stacks which use HCFS implementations (for example Gluster (GlusterFS), S3, etc…).

To qoute Roman Shaposhnik on the nature of the big top project:

I’ll stick with my Linux distro analogy. We’re a Bigdata management
distro. Our kernel is Hadoop and our userland is all the other projects.
We produce the packages for end users to consume and we also
maintain code that facilitates production of those packages (build,
deploy, test).

A First Look At BigTop …

There is a blog post dedicated just to the very subject of explaining exactly what big top “really” is. As… its an abstract and different sort of project : https://blogs.apache.org/bigtop/entry/bigtop_and_why_should_you.

Note, the urls are a little obsolete, replace the SVN ones with the respective GIT urls for the current bigtop release, which is here: https://github.com/apache/bigtop/tree/master.

What does this mean for us (the hadoop user community)?

It means that someone out there has realized and acted on the fact that “hadoop” is not just HDFS + MapReduce anymore. Its a BIG family of scalable compute tools that give us different performance and semantic specifications for wrangling tera and pedascale data sets. And tons of transitive dependencies –

Hadoop’s transitive dependencies, borrowed from this blog post

This makes thing tricky, for example, we have found some funny salient problems when mixing and matching hadoop components willy-nilly:

MR2 is even more complicated than the original hadoop. No more are the start-all.sh/stop-all.sh scripts all that you need to run a cluster – there are more moving parts in hadoop 2.X.

The moral of the story hadoop ecosystem setups need to be carefully assembled.

Big Top to the rescue?

Big top is a meta project for the hadoop ecosystem that ties it all together. Here is how it works.

1) Big top builds the repos and packages for you

2) Bit top tests the packages for you

3) You can use big top packaged VMs to start out with a hadoop distribution intact and run your code against a psuedo-distributed standard hadoop setup. Although at the moment bigtop doesn’t produce a “full” VM with all of the ecosystem tools pre-installed, the current version does come with a working HDFS+MR2 stack.

How to set it up in 5 minutes on KVM

Virt-install is the linux command line VM construction workhorse. Cobbling it together with a hard link to the big top servers gives you a handy script for spinning up a hadoop VM in seconds:

wget http://bigtop01.cloudera.org:8080/job/Bigtop-VM-matrix/BR=master,KIND=kvm,label=fedora16/lastSuccessfulBuild/artifact/bigtop-vm-kvm-master.tar.gz

tar -xvf bigtop-vm-kvm-master.tar.gz

#Note, you can increase the RAW image size using qemu’s “resize” command, and then follow the instructions here to make the available, formatted FS size equal to the size of the RAW image. Then, you can run large MR jobs or the whole suite of bigtop tests on the VM.

virt-install –import -n vmname -r 2048 –os-type=linux –disk ./bigtop-vm-kvm-master/bigtop_hadoop-sda.raw,device=disk,bus=virtio,size=8,sparse=true,format=raw –vnc –noautoconsole

And now, you can fire up Virtual Machine Manager and log in:

username=root
password=bigtop

Now, you can run a local mapreduce job.

just to play with the hadoop MR2 libraries before actually testing your bigtop instance against HDFS. You can also use your big top VM to test your other custom map reduce jobs before deploying at scale.

Also, this is a good place to start if you want to run MR2 MapReduce jobs against a different (i.e. HCFS) file system.

export JAVA_HOME=/usr/lib/jvm/java-1.6.0/
export HADOOP_CLIENT_OPTS=”-Xmx1G $HADOOP_CLIENT_OPTS”

/usr/lib/hadoop/bin/hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.5-alpha.jar pi -Dfs.default.name=file:/// -Dmapreduce.framework.name=local -Dyarn.resourcemanager.address=local 2 10

Now, lets look at the running hadoop processes by running the most important program in the entire world, jps:

Appliance:    bigtop_hadoop appliance 2.0
Hostname:    localhost.localdomain
IP Address:    192.168.122.119

[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]# jps
2869 NodeManager
1153 DataNode
2985 ResourceManager
28892 Jps
1327 SecondaryNameNode
2786 JobHistoryServer
1213 NameNode

Okay, so we’ve confirmed that MapReduce is installed and run a local job, and also confirmed that the psuedo distribtued hadoop components are working.

** Now… You can run Big Top Smoke tests on a real cluster **

To understand how to run big top on a cluster, you can start here, in your little VM. In any case, the directions are essentially the same:

1) Clone bigtop from github.

2) Build it with maven (this will install jars locally).

3) Customize your bigtop smokes/pom.xml files, if you want to remove some of the tests (i.e. comment out of the ecosystem smoke tests maven submodules)

4) Invoke the smokes/ task in maven, and wait for bigtop to report the output. (This could take a while).

The details…

The idiom for customizing big top tests wasn’t completely obvious at first, at least not to me. There are some preliminaries to understand:

1) iTest is a shell based unit testing framework that makes running command line apps and testing their results from groovy/java a breeze. Big top is based on itest.

2) Maven’s “verify” task, which comes AFTER the maven integration test part of the life cycle, will complete the process of running smoke tests for you. Be careful to run the verify task as stated in the BigTop readme.

The way that big top works is by testing at the “real world” level, that is, by running direct commands from a shell that exersize the components of your ecosystem. This means that the tests themselves don’t have direct compiletime dependencies on the various ecosystem tools. For example, here is a snippet from the smoke tests for pig:

shell.exec(“cat $DATA_DIR/mysql-create-db.sql | $MYSQL_COMMAND -u root ${”.equals(MYSQL_ROOTPW) ?: ‘-p’ + MYSQL_ROOTPW}”);
assertEquals(‘Unable to run mysql-create-db.sql script’, 0, shell.
shell.exec(“cat $DATA_DIR/mysql-load-db.sql | $MYSQL_COMMAND -u root testhbase”);
assertEquals(‘Unable to run mysql-load-db.sql script’, 0, shell.getRet());
println “MySQL database prepared for test”;
shell.exec(“cat $DATA_DIR/drop-table.hxt | $HBASE_HOME/bin/hbase shell”);
shell.exec(“cat $DATA_DIR/create-table.hxt | $HBASE_HOME/bin/hbase shell”)
def out = shell.out.join(‘\n’);

That is – the big top tests will invoke shell commands which pipe command text directly into executable programs in your ecosystem (note that the above dependency on mysql is a current requirement, but may be removed in the near future).

And finally… here’s how you test your cluster.

I’ve recently written up how to Customize your big top smoke testing suite. But of course, first you have to set up the tests.

Here is how that works:

1) To run big top as a smoke test for your cluster, you first grab the source :

git clone https://github.com/apache/bigtop.git

Make sure you have maven installed.

2) In order to run the smoke tests, you follow the README instructions, which are relatively simple – just two steps, first one installs and builds the jars locally using maven, the second runs the tests, again using maven.

First, build all the smoke tests. This will install corresponding jar’s locally:

#set hadoop home first, its required.
export HADOOP_HOME=<your hadoop install location>
cd ./bigtop-tests/test-execution/smokes && mvn install -DskipTests -DskipITs -DperformRelease 

3) Now, you will probably want to customize your bigtop tests before running the whole battery. But to start, you can run them all just to see the way that the bigtop suite works.

And finally, invoke the test execution:

mvn -fae clean verify -Dorg.apache.bigtop.itest.log4j.level=TRACE -f bigtop-tests/test-execution/smokes/pom.xml

4) All the test results will be streamed to standard out at the end (warning: this takes a while). Also note that currently there are a few other dependencies, such as MySQL if running sqoop tests (side note: there is a JIRA currently in to update this to a pure java DB). Finally, to deep-dive into a test failure, you can check the individual smoke test target directories:

cat ./smokes/mahout/target/failsafe-reports/org.apache.bigtop.itest.mahout.smoke.TestMahoutExamples.txt

Simplified ecosystem infrastructure with Big Top.

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...