The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

Do Hadoop Adapters Make Sense?

Gluster
2012-07-24

Daniel Abadi described his blog entry about Hadoop connectors as a “Stonebraker-style rant” and then delivered on the threat. Like everything Stonebraker has written in the last five years, it’s based on a fundamentally flawed premise, which is that HDFS stores unstructured data. This assumption is not clearly stated, but it’s pretty clear from context, e.g.

primary storage in Hadoop (HDFS) is a file system that is optimized for unstructured data

Among storage folks, “unstructured” means stuff that is written using any of the millions of applications of the last thirty years that write their data using the storage API that’s built into every operating system – i.e. files. HDFS does not store unstructured data. If it did, there would be no need to import true unstructured data into HDFS. HDFS is not quite a real file system, usable by a significant percentage of applications. It’s designed to support Hadoop itself and that’s pretty much all it does with any degree of competence. Can you extract and build a Linux kernel source tree in HDFS? Even if you could, how good do you think a system designed around management of 128MB chunks work for that? (BTW yes, I can think of ways to analyze this kind of data using Hadoop, so it’s not an entirely silly example. Use your imagination.)

Going back to Daniel’s argument, RDBMS-to-Hadoop connectors are indeed silly because they incur a migration cost without adding semantic value. Moving from one structured silo to another structured silo really is a waste of time. That is also exactly why filesystem-to-Hadoop connectors do make sense, because they flip that equation on its head – they do add semantic value, and they avoid a migration cost that would otherwise exist when importing data into HDFS. Things like GlusterFS’s UFO or MapR’s “direct access NFS” decrease total time to solution vs. the HDFS baseline. I’d put DataStax’s Brisk into the same category, even though the storage it provides is structured underneath, because it also works “inline” like the native file system alternatives instead of requiring an import phase like HDFS. The fact that you can also use CassandraFS to work with file-based applications is just icing on the cake.

So, are Hadoop adapters a good idea or a bad one? It depends on what kind you’re talking about. Daniel and I can agree that ETL-style RDBMS connector doesn’t make any sense at all, but I believe that an inline unstructured-data connector is a different story altogether.

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more