The HDFS write path is lonnnng and hairy. Here’s some imagery of it (somewhat raw and undervalidated, so please comment if something looks funny). |
Have you ever seen those little salmons that swim ALL THE WAY up the river, into the ocean, just to breed? Well thats kinda how k/v pairs in MapReduce applications work. They have to go a LONG WAY before the finally get to reside somewhere permanently on local disk.
The fact that MapReduce abstracts “key/value” pairs as an application level nicety makes the write path for a real file very intriguing.
First off – MapReduce distributes your k/v pairs into partitions:
Note : Partitions are a user-level feature – is the fundamental mechanism for distributing algorithms over a cluster. Since each partition corresponds to a single reducer, you need to be careful that you partition your workloads evenly – otherwise you’ll get the “long-tail” problem (example: a web crawler with keys as domain roots with default partitioning will be extremely inefficient – because the most common sights will only be crawled in a single reducer).
Next: The partition files are broken into blocks, and written to the DFS:
There are thus at least two layers of client side buffering that occur in HDFS – one, the buffering of the output stream which is directly writing over a socket to remote blocks, and two, the buffering which occurs as a natural consequence of the fact that “Packets” accumulate a certain amount of bytes in memory before they are put on the write queue.
Inspect the k/v salmon-run write path for yourself :
There could be some ambigueties or (gasp) inaccuracies in the diagram above. Please do feel free to validate it and comment. The class names correspond directly to those used in the nodes of this graph. The github urls for the corresponding hadoop projects are :
Of course, it would behoov you to scour this code using eclipse, since there are 100s of relevant classes, and you can easily build eclipse projects from the sub projects by running “mvn eclipse:eclipse”.
You might also want to run the full build. In order to do that you’ll have to have protobuffs installed : http://stackoverflow.com/questions/15745010/org-apache-maven-plugin-mojoexecutionexception-protoc-failure.
Generating the graph :
This graph can be generated in graphviz using the neato layout, or on erdos http://sandbox.kidstrythisathome.com/erdos/, which can visualize reasonably sized graphviz snippets.
digraph g{
node [shape=record];
MapOoutputCollector [label=”<f1> DirectMapOutputCollector|<f2> MapOutputBuffer”];
DFSClient -> DFSOutputStream [label=”writes to”];
DFSOutputStream -> Streamer [label=”create”] ;
DFSOutputStream -> AckQueue [label=” puts packets”];
Streamer -> AckQueue [label=”take packets”];
Streamer -> DataNode [label=”write packet”] ;
Streamer -> Socket [label=”read ack”] ;
DataNode -> Socket [label=”write ack”];
DistributedFileSystem -> DFSClient [label=”creates a”];
TaskTracker -> MapTask [label=”creates”];
MapTask -> UserMapper [label=”run(context,rReader,rWriter)”];
UserMapper -> MapOoutputCollector [label=”forwards (k,v) writes to”];
MapOoutputCollector -> SequenceFileOutputFormat [label=”writes (k,v) to”];
SequenceFileOutputFormat -> SequenceFileOutputFormat_Writer [label=”creates inner”];
SequenceFileOutputFormat_Writer -> FSDataOutputStream [label=”writes byes to”];
TextOutputFormat_Writer -> FSDataOutputStream [label=”writes bytes to”] ;
FSDataOutputStream -> DistributedFileSystem [label=”connects to “];}
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...