Why Does Cloudera Really Use HDFS?

Gluster

2012-07-26

Apparently, someone in Hadoop-land is getting worried about alternatives to HDFS, and has decided to address that fear via social media instead of code. Two days ago we had Daniel Abadi casting aspersions on Hadoop adapters. Today we have Charles Zedlewski explaining why Cloudera uses HDFS. He mentions a recent GigaOm article listing eight alternatives, and manages to come up with a couple more, but still manages to miss at least one. (Hint: what site are you reading right now?) He also makes a really weird comparison to the history of Linux, as though HDFS is the only option that’s open and hardware agnostic. What crap. Practically all of the alternatives share those qualities. He also makes some other misleading claims on HDFS’s behalf.

it has excellent … high availability (that’s right folks, drop the SPOF claims, you can download CDH4 here!)

That’s right, folks. After years of denying that the NameNode SPOF mattered, then more time wasted trying to push the problem off on someone else (e.g. shared SAN or NFS), HDFS finally has its very own high availability. Congratulations on reaching parity with the rest of the world. I’d hold off on “excellent” until more than a couple of real-world users have tried it, though.

[HDFS offers] Choice – Customers get to work with any leading hardware vendor and let the best possible price / performer win the decision, not whatever the vendor decided to bundle in.

Um, right. How many of those dozen-or-more alternatives can I not deploy on just as wide a range of hardware and operating systems as HDFS? A couple, maybe? Pure FUD.

Portability – It is possible for customers running Hadoop distributions based on HDFS to move between those different distributions without having to reformat the cluster or copy massive amounts of data.

This is also not an HDFS exclusive. Any of the alternatives that were developed outside the Hadoopiverse have this quality as well. If you have data in Cassandra or Ceph you can keep it in Cassandra or Ceph as you go Hadoop-distro shopping. The biggest data-portability wall here is HDFS’s, because it’s one of only two such systems (the other being MapR) that’s Hadoop-specific. It doesn’t even try to be a general-purpose filesystem or database. A tremendous amount of work has gone into several excellent tools to import data into HDFS, but that work wouldn’t even be necessary with some of the alternatives. That’s not just a waste of machine cycles; it’s also a waste of engineer cycles. If they hadn’t been stuck in the computer equivalent of shipping and receiving, the engineers who developed those tools might have created something even more awesome. I know some of them, and they’re certainly capable of it. Each application can write the data it generates using some set of interfaces. If HDFS isn’t one of those, or if HDFS through that interface is unbearably slow because the HDFS folks treat anything other than their own special snowflake as second class, then you’ll be the one copying massive amounts of data before you can analyze it . . . not just once, but every time.

That brings us to performance, which is also interesting because Charles barely mentions it (tangentially in his “choice” point). Isn’t Hadoop supposed to be all about going fast? Why else would they take short cuts like telling the caller that a write is complete when in fact it has only been buffered to local disk? I’m not going to say HDFS is not fast, but many of the alternatives are provably faster. That’s why they’re the ones pushing data faster than any Hadoop installation ever has, on the world’s biggest systems. I’m sure Hadoop is there too, and so HDFS probably is too, but it’s not the thing actual applications are using. Why not?

I suspect that the real reason Cloudera uses HDFS is not anything Charles mentions. I don’t know why they built it originally, since some of the alternatives already existed and a few more were at least well along as new projects. Maybe it’s because they wanted something written in the same language/environment as the rest of Hadoop, so instead of contributing patches to an existing filesystem for the one new feature they needed (data-locality queries) they went and built their own not-quite-filesystem. Maybe it’s because they assumed that whatever GoogleFS did was The Right Thing, so they cloned it and then kept thinking that “one more workaround” would make it all that they had originally hoped. Either way, whether it was NIH syndrome or NIAG (Not Invented At Google) syndrome, it developed its own inertia.

At this point the issue is more likely to be familiarity. It’s what their engineers know, and it’s what their customers know. Both groups know how to tune it, and tune other things to work with it. Moving to something else would not only take a lot of effort but take people out of their comfort zones as well, and it would also open them up to questions about why they expended all those resources on a sub-optimal solution before. Of course they’re going to stick with it, and even double down, because they have no choice. The rest of us do.

UPDATE: Eric Baldeschwieler at HortonWorks has posted their response to the GigaOm article. In my opinion he does a much better job identifying the qualities that HDFS and any would-be competitors must have and measuring the alternatives against those qualities. We might disagree on some points (e.g. sequential I/O patterns are common and generally well optimized so why the hell should I/O systems design for Hadoop instead of the other way around?) but kudos for a generally excellent response.

Why Does Cloudera Really Use HDFS?

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...

Why Does Cloudera *Really* Use HDFS?

BLOG

Looking back at 2020 – with g...

Update from the team

Building a longer term focus for Gl...

Why Does Cloudera Really Use HDFS?