The Gluster Blog

Gluster blog stories provide high-level spotlights on our users all over the world

A few nice posts about distirbuted parts of Mahout implementations

Gluster
2014-02-24

In this post I’ll try to keep a running tally of distinctions between distributed and non distributed algorithm implementations in mahout, as it can be tricky to keep track of sometimes.  When using mahout its important to know what implementation of an algorithm to use.  A little known fact of mahout is that it has both distributed and single machine optimized implementations for particular algorithms (i.e. naive bayes – only runs on clusters, but logistic regression in mahout is optimized for single machines).

And, meanwhil alot of Mahout tutorials on the web attempt to explain at least 2 of the following 3 very deep topics:

  1. Machine Learning
  2. MapReduce 
  3. Statistics 

In a matter of a few pages.   This is IMPOSSIBLE.  Its also confounding.  

For example, in the case that you might want to create a distributed recommender, you might follow this (otherwise excellent) tutorial: http://kickstarthadoop.blogspot.com/2011/05/generating-recommendations-with-mahout_26.html.  However, even though the blogpost is titled “kickstart hadoop”, it actually explains the NON hadoop implementation.  Doh !

So, in general, if your new to mahout, and your planning on processing terabytes (or more) of data — you will want to be careful that you use the hadoop specific, scalable mahout APIs in writing your mahout jobs, and not just the in memory ones, which are TOTALLY different.

So, captain obvious here will point out some bullets for you: If a mahout task is going to scale then.

  1. Obviously will not point to any file:// urls
  2. Wont use much memory, and will be broken into several Mapper/Reducer segments.
  3. Will have a reference to hadoop somewhere in the packages.  

Docs are not quite as easy to find on the distribtued  mahout implementations, ironically… So, here are some posts on the distributed variety of mahout jos that I dug up.

Topic Extraction

Heres one of the better resources for understanding how to do distributed topic extraction I found: http://odbms.org/download/TamingTextCH06.pdf. This is an excerpt from the TamingText book.  In particular, it describes the fact that you will need to create vectors.  You might want to look into http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program while your at it. I still havent found a good library for creating sequence vectors in parallel i.e. as part of a mapreduce pipeline.

Classification 

As classification is hot, and algorithmically complex – its easy to get lost in articles that describe ROC curves and correlation coefficients, etc.  The reality is that most of us already know a little about this kind of thing – and in any case – learning about mahout and ROC curves at the same time (for those who have experience with neither) – is likely to cause an aneurism.  If you want to get straight to the distributed part of things, check out:
http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/ which backlinks to a first primer article about how to create the training set (non distributed).  Combining content from those two articles yields a pretty simple workflow for classification that is implemented, for the “big”  portion, in MapReduce.  Creating the models doesnt really need mapreduce quite as much.

Note that the naive bayes implementation DOESNT exist for single machines, wheras the logistic regression is optimized for single machine classification. 

Recommendations

There are two ways to do recommendation in Mahout : USing a distribtued, and non-distributed code path.  The non-distributed code path somehow is covered ALL OVER THE PLACE, but alas, if we really needed mahout, we probably wouldnt need a non parallel implementation.  Thankfully this post: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ covers some of the finer details of distributed versus non-distributed recommenders with hadoop.  In particular, its the https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.html implementation that you want to use.

BLOG

  • 06 Dec 2020
    Looking back at 2020 – with g...

    2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...

    Read more
  • 27 Apr 2020
    Update from the team

    It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...

    Read more
  • 03 Feb 2020
    Building a longer term focus for Gl...

    The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...

    Read more