MapReduce in C# using LINQ

I recently remembered reading this article by Dare Obasanjo (Functional Programming in C# 3.0: How Map/Reduce/Filter can Rock your World) a long while ago, which was partly a response to Joel Spolsky’s article (Can Your Programming Language Do This).  In that article, Dare maps the map/reduce/filter functions to the following Linq equivalents:

  • map -> Enumerable.Select
  • reduce -> Enumerable.Aggregate
  • filter -> Enumerable.Where

At the time, I hadn’t really had a business need to play with Linq aside from casual curiosity … all of the projects at work were using other techniques for data access.  But more recently, we’ve been using linq more and more, and I experimented with how a rewrite of a feature might look if it had been written using a functional programming style with LINQ.

My experience with that reminded me of something I had read even longer ago, Google’s famous MapReduce paper (MapReduce: Simplified Data Processing on Large Clusters). Now, most of that paper deals with how they distribute the load across a large cluster of computers.  But near the very beginning, they have some small examples of the types of problems that could easily be solved by mapreduce, and even a sample pseudocode implementation of one of those programs.

map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediate(w, "1");

reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseInt(v);
    Emit(AsString(result));

Note the structure of the various inputs and outputs … in particular, the output of map (which serves as the input to reduce).  It’s a key along with an iterator of values that matched that key.  That sounds exactly like what GroupBy does.  Check out the description from this article:

The standard query operators also include the GroupBy operator, which imposes a partitioning over a sequence of values based on a key extraction function.

Using linq, I can implement the mapreduce program described above:

var wordOccurrences = words
                .GroupBy(w => w)
                .Select(intermediate => new
                {
                    Word = intermediate.Key,
                    Frequency = intermediate.Sum(w => 1)
                })
                .Where(w => w.Frequency > 10)
                .OrderBy(w => w.Frequency);

as you can see … based on google’s definition of MapReduce, I think that Dare was incorrect in asserting that select maps to map (no pun intended), and aggregate maps to reduce.  In this example, the GroupBy method is acting as the map, while the Select method does the job of reducing the intermediate results into the final list of results.

The obvious caveat here is that this isn’t distributed, it will run locally in memory.   But even without the massive scaling that google’s mapreduce data center can provide, it can be useful to think of problems in this mindset.  And when the parallel extensions for linq are finally released, you might even be able to easily take advantage of multiple cores.

3 Comments »

  1. George Tsiokos Said,

    February 27, 2009 @ 10:55 am

    Here’s another implementation:

    DryadLINQ
    “the following line of code is a complete implementation of the Map-Reduce computation framework in DryadLINQ: public static IQueryable MapReduce(this IQueryable source, Expression> mapper, Expression> keySelector, Expression, TResult>> reducer) { return source.SelectMany(mapper).GroupBy(keySelector).Select(reducer); } ”

    http://research.microsoft.com/en-us/projects/dryadlinq/default.aspx

    I hope they bring this to Azure…

  2. Maxi Said,

    December 27, 2009 @ 10:47 pm

    I think Linq still has a long way to go to match with MapReduce.
    I don’t think distributed simply means multi-cores but multi-machines.
    I am currently using Linq and try to make it distributed, but its strong connection state will gets in your way since your result set are not independent. I think MapReduce ideas is not only the syntax but the whole architecture.

  3. Joel Martinez Said,

    December 31, 2009 @ 3:13 pm

    yes, you are right that distributed is more than multiple core. However, the point of the article wasn’t to comment on the distributed nature of the popular mapreduce implementations, but to “map” linq operations into the mapreduce paradigm. Map/Reduce is not inherently distributed anyways, that was just google’s implementation :-)

RSS feed for comments on this post

Leave a Comment