MapReduce in C# using LINQ

I recently remembered reading this article by Dare Obasanjo (Functional Programming in C# 3.0: How Map/Reduce/Filter can Rock your World) a long while ago, which was partly a response to Joel Spolsky's article (Can Your Programming Language Do This). In that article, Dare maps the map/reduce/filter functions to the following Linq equivalents:

map -> Enumerable.Select
reduce -> Enumerable.Aggregate
filter -> Enumerable.Where

At the time, I hadn't really had a business need to play with Linq aside from casual curiosity ... all of the projects at work were using other techniques for data access. But more recently, we've been using linq more and more, and I experimented with how a rewrite of a feature might look if it had been written using a functional programming style with LINQ.

My experience with that reminded me of something I had read even longer ago, Google's famous MapReduce paper (MapReduce: Simplified Data Processing on Large Clusters). Now, most of that paper deals with how they distribute the load across a large cluster of computers. But near the very beginning, they have some small examples of the types of problems that could easily be solved by mapreduce, and even a sample pseudocode implementation of one of those programs.

map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediate(w, "1");

reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseInt(v);
    Emit(AsString(result));

Note the structure of the various inputs and outputs ... in particular, the output of map (which serves as the input to reduce). It's a key along with an iterator of values that matched that key. That sounds exactly like what GroupBy does. Check out the description from this article:

The standard query operators also include the GroupBy operator, which imposes a partitioning over a sequence of values based on a key extraction function.

Using linq, I can implement the mapreduce program described above:

var wordOccurrences = words
                .GroupBy(w => w)
                .Select(intermediate => new
                {
                    Word = intermediate.Key,
                    Frequency = intermediate.Sum(w => 1)
                })
                .Where(w => w.Frequency > 10)
                .OrderBy(w => w.Frequency);

as you can see ... based on google's definition of MapReduce, I think that Dare was incorrect in asserting that select maps to map (no pun intended), and aggregate maps to reduce. In this example, the GroupBy method is acting as the map, while the Select method does the job of reducing the intermediate results into the final list of results.

The obvious caveat here is that this isn't distributed, it will run locally in memory. But even without the massive scaling that google's mapreduce data center can provide, it can be useful to think of problems in this mindset. And when the parallel extensions for linq are finally released, you might even be able to easily take advantage of multiple cores.