Hadoop : Gotchas when gluing it all together. Part 1

Gluster

2012-03-19

Today I’m writing a new map/r job, from scratch, trying to minimally copy code from other jobs. The principles behind hadoop are simple : you separately map records into key->value[] arrays, and then you convert those keys to integers, and distribute machines using the integer ids. This gets you parallelism for free. However – nothing is free. The stateless nature of this process, which provides us with so much scalability, comes at a high cost – challenging some of the static coddling that java has given us for so many years. This doesn’t become clear until you start making your own map reduce jobs – using custom types, or when you get away from the single-mapper->single-reducer paradigm which we’ve all come to know and love in our “hello world” WordCount examples.

If map/reduce is simple — than why is Hadoop tricky ?

– Hadoop is dynamic : so you can’t always hide behind static type conventions which we have so come to love.

– Hadoop data processing flows are abstract – directories, temporary files, and other aspects of its batch processing are handled for you, under the hood – which means, when things go wrong – the problem is not necessarily in your code.

– Hadoop “tries” to do assume things – for example – it assumes that you want to do a reduce job, unless you tell it otherwise. It ‘sort of’ assumes your inputs are of type [ Long-> Text ] unless you tell it otherwise. etc. etc…

– Although Hadoop can simulate the “cloud” locally – its just a simulation. Thus, the libraries and local java tricks that your used to doing (i.e. manually specifying the main class, fatjars , -Xmx2G, … ,and of course — file paths) have to be modular enough that they aren’t “lost in translation” when you deploy for real-time. Also, this means that there will be aspects in your code which appear to be meaningless when your developing on a local machine.

Context : This map/reduce job simply reads in some text json, parses it, and maps serializes some of the fields into java beans. There is no reducer – the work done by the mapper is the final output. I have not included the code here, but its not needed – anyone who has ever read through a simple hadoop tutorial should be familiar with the snippets below. But of course, if your interested in learning more, you can always twitter me for some more details.

…. So here’s where the story begins ….

So – to start off my Map/Reduce day I created a directory called “MyOutputDirectory”. I figured I could just write all my files out to that directory while testing my code. Then I could easily know where to look after I ran my test jobs. Bad idea.

1) Exception in thread “main” org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory MyOutputDir/ already exists.

This is the first error of the day. In hindsight – its pretty simple. When you launch a hadoop job – you need to do it in a new map/reduce output directory. Hadoop creates this directory FOR YOU. Thus, you can’t run the same job twice with the same output file arguments. Second of all, when we use

FileOutputFormat.setOutputPath(job, out);

We are actually setting an output DIRECTORY. That directory will contain the famous part-r* files. so remember – your jobs should either be smart enough to make up new directory names (i.e. using millisecond time stamps, or whatver), or they should delete themselves after completion (i.e. in the case that your simply writing unit tests or other development related sandbox tasks).

SOLUTION : HAVE A STRATEGY FOR DEFINING the output paths THAT IS DYNAMIC and SIMPLE – and STICK TO IT !

Now that hadoop was happy that I wasn’t “going over its head” by making my output directory for it – I was able to launch a job.

Okay … so now that all the “glue” was working, I could actually run the whole mapper.

2) java.io.IOException: wrong key class: a.b.c.MySerializedBean is not LongWritable

The next aggravation I dealt with was regarding these lines – staples in any Map/Reduce job :

job.setMapOutputKeyClass(MySerializedBean.class);
job.setMapOutputValueClass(NullWritable.class);

Seems innocuous – right ? Im mapping my serialized beans as output. There is no reducer output. So thats it.

NOPE. It turns out that, if you don’t have a reducer, Hadoop will assume that your using default reducer key/values (LongWritable,TextWritable). Why ? Well… I guess because the WordCount example is the benchmark default for hadoop — so this made sense. Thus, you have to additionally specify :

job.setOutputKeyClass(MySerializedBean.class);
job.setOutputValueClass(NullWritable.class);

Then it works. But … guess what else ? It turns out that the first two lines can be commented out!
That is – in the case that you don’t have a reducer, the output key/value class specification is enough. I’m actually not sure about why that is the case – I’ll have to defer to the my secret weapon : the datasalt folks, on this one. Any thoughts , Pere ? 🙂

3) java.lang.ClassCastException: class a.b.c.MySerializedBean

The class cast exception is a tricky one that pops up in alot of places, due to Hadoops highly dynamic architecture. In this case ….My job has no reducer . This means, that hadoop, in trying to create an OutputKeyComparator (necessary for the reducer), fails.

     SOLUTION : job.setNumReduceTasks(0);
     This snippet will tell hadoop not to worry so much about the reducer. I’m not 100% happy with
     this as a solution : I think its a little too imperative.

4) Just to demonstrate the 4th point (i.e. that you have to take precautions during development so that your code runs normally in the clustered environment), I deleted the following line of code which is common in map/reduce jobs :

myJob.setJarByClass(MyClass.class);

Guess what ? Nothing broke locally. Guess what else ? I bet this would be a huge bug if this deletion got pushed to the cluster 🙂

THIS IS JUST PART 1 – mischa hasn’t run this in a cluster yet – we’ll see what happens when he gets back to me.

Some things that might be cool to augment hadoop with in the future:

1) A higher-level framework which forced us , by use of abstract methods and interfaces, to return key/value classes in a consistent way, and which analyzed our map/reduce jobs at runtime, with the intent of giving us “developer feedback” in an intuitive manner.

2) Default outputs are probably not worth the trouble – I’d rather throw a “you forgot to specify a default input type” than a “class cast exception” at runtime. There is a very low cost to the 2 lines of code necessary for specifying key/value input/output types — and a high benefit to readability and debuggability. Again – a higher level framework for writing map/reduce jobs might be in order here.

3) Overall – I think the take home is that, over time, it will be nice to watch the hadoop api become less imperative, and more declarative. The datasalt folks have already seen some of these issues and thus created the Pangool API for dealing with some of these issues.

Hadoop : Gotchas when gluing it all together. Part 1

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...