The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

Enabling Apache Hadoop on GlusterFS: glusterfs-hadoop 2.1 released

Gluster
2013-09-05

The Gluster community is pleased to announce a major update to the glusterfs-hadoop project with the release of version 2.1. The glusterfs-hadoop project provides an Apache licensed Hadoop FileSystem plugin which enables Apache Hadoop 1.x and 2.x to run directly on top of GlusterFS. This release includes a re-architected plugin which now extends existing functionality within Hadoop to run on local and POSIX File Systems.

Overview

Apache Hadoop has a pluggable FileSystem Architecture. This means that if you have a filesystem or object store that you would like to use with Hadoop, you can create a Hadoop FileSystem plugin for it which will act as a mediator between the generic Hadoop FileSystem interface and your filesystem of choice. A popular example would be that over a million Hadoop clusters are spun up on Amazon every year, a lot of which use Amazon S3 as the Hadoop FileSystem.

In order to configure the plugin, a specific deployment configuration is required. Firstly, it is required that the Hadoop JobTracker and TaskTrackers (or the Hadoop 2.x equivalents) are installed on servers within the gluster trusted storage pool for a given gluster volume. The JobTracker uses the plugin to query the extended attributes for job input files in gluster to ascertain file placement as well as the distribution of file replicas across the cluster. The TaskTrackers use the plugin to leverage a local fuse mount of the gluster volume in order to access the data required for the tasks. When the JobTracker receives a Hadoop job, it uses the locality information it ascertains via the plugin to send the tasks for the Hadoop Job to Hadoop TaskTrackers on servers that have the data required for the task within their local bricks. This ensures data is read from disk and not over the network. The diagram below provides an overview of the entire solution for a Hadoop 1.x deployment.

Figure 1 – Solution Architecture

glusterfs-hadoop

The community project, along with the documentation and available releases, is hosted within the Gluster Forge. The glusterfs-hadoop project will also be available within the Fedora 20 release later this year, alongside fellow Fedora newcomer Apache Hadoop and the already available gluster project. The glusterfs-hadoop project team welcomes contributions and participation from the broader community.

Stay tuned for upcoming posts around GlusterFS integration into the Apache Ambari and Fedora projects.

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more