The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

The k/v pair salmon run in mapreduce -> hdfs.

Gluster
2013-04-23
The HDFS write path is lonnnng and hairy.  Here’s some imagery of it (somewhat raw and undervalidated, so please comment if something looks funny).

Have you ever seen those little salmons that swim ALL THE WAY up the river, into the ocean, just to breed?  Well thats kinda how k/v pairs in MapReduce applications work.  They have to go a LONG WAY before the finally get to reside somewhere permanently on local disk. 

The fact that MapReduce abstracts “key/value” pairs as an application level nicety makes the write path for a real file very intriguing.

First off – MapReduce distributes your k/v pairs into partitions: 

  • For a given MapReduce job, you typically have several output files.  These are called partitions (part-r-0000, part-r-0001, …).
  • Each file in HDFS is broken into BLOCKS.  
  • The partitions are requested by the MapReduce layer – every time a mapper runs, a “part-****” file output stream is created.  This is done by the FileOutputFormat classes.
  • The first “level” of buffering that you control is in the FileOutputFormat – which takes k,v pairs directly.  Although TextOutputFormat  doesn’t seem to buffer, other output formats (SequenceFileOutputFormat), actually do.  

Note : Partitions are a user-level feature – is the fundamental mechanism for distributing algorithms over a cluster.   Since each partition corresponds to a single reducer, you need to be careful that you partition your workloads evenly – otherwise you’ll get the “long-tail” problem (example: a web crawler with keys as domain roots with default partitioning will be extremely inefficient – because the most common sights will only be crawled in a single reducer).

Next: The partition files are broken into blocks, and written to the DFS:

  • Buffering of writes occurs inside of the DataStreamer, which creates a “blockStream” for writing.
  • The main job of the DFSOutputStream class is to translate bytes into packets that can be written and acknowledge reliably. 
  • The DFSOutputStream uses its inner DataStreamer class to handle the logic of creating OutputStreams which directly write to, and acknowledge progress, of writing contents to a block.
  • Writing to the DFSOutputStream is fast – no waiting on remote calls  synchronously.  All acks are aynchronously done (if youre reading this post, though, you probably alredy know that).
  • The DataStreamer picks the packets up off the ackQueue, and once a packet is the “last” one in a block, the block is closed for writing, and a new one is created.

There are thus at least two layers of client side buffering that occur in HDFS – one, the buffering of the output stream which is directly writing over a socket to remote blocks, and two, the buffering which occurs as a natural consequence of the fact that “Packets” accumulate a certain amount of bytes in memory before they are put on the write queue. 

Inspect the k/v salmon-run write path for yourself :


There could be some ambigueties or (gasp) inaccuracies in the diagram above. Please do feel free to validate it and comment.  The class names correspond directly to those used in the nodes of this graph.  The github urls for the corresponding hadoop projects are : 

  • https://github.com/apache/hadoop-common
  • https://github.com/apache/hadoop-hdfs
  • https://github.com/apache/hadoop-mapred 

Of course, it would behoov you to scour this code using eclipse, since there are 100s of relevant classes, and you can easily build eclipse projects from the sub projects by running “mvn eclipse:eclipse”. 

You might also want to run the full build.  In order to do that you’ll have to have protobuffs installed : http://stackoverflow.com/questions/15745010/org-apache-maven-plugin-mojoexecutionexception-protoc-failure.  

Generating the graph :
 This graph can be generated in graphviz using the neato layout, or on erdos http://sandbox.kidstrythisathome.com/erdos/, which can visualize reasonably sized graphviz snippets. 

digraph g{
  node [shape=record];
  MapOoutputCollector [label=”<f1> DirectMapOutputCollector|<f2> MapOutputBuffer”]; 
  DFSClient -> DFSOutputStream [label=”writes to”];
  DFSOutputStream -> Streamer [label=”create”] ;
  DFSOutputStream -> AckQueue [label=” puts packets”];
  Streamer -> AckQueue [label=”take packets”];
  Streamer -> DataNode [label=”write packet”] ;
  Streamer -> Socket [label=”read ack”] ;
  DataNode -> Socket [label=”write ack”];
  DistributedFileSystem -> DFSClient [label=”creates a”];
  TaskTracker -> MapTask [label=”creates”];
  MapTask -> UserMapper [label=”run(context,rReader,rWriter)”];
  UserMapper -> MapOoutputCollector [label=”forwards (k,v) writes to”];
  MapOoutputCollector -> SequenceFileOutputFormat [label=”writes (k,v) to”];
  SequenceFileOutputFormat -> SequenceFileOutputFormat_Writer [label=”creates inner”];
  SequenceFileOutputFormat_Writer -> FSDataOutputStream [label=”writes byes to”];
  TextOutputFormat_Writer -> FSDataOutputStream [label=”writes bytes to”] ;
  FSDataOutputStream -> DistributedFileSystem [label=”connects to “];}

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more