Thursday, March 14, 2013

Couchbase Map/Reduce/Rereduce

MapReduce (http://en.wikipedia.org/wiki/MapReduce) has recently become one of my favorite things to talk about in computing. I am currently in charge of building a system that uses Couchbase to store large amounts of denormalized data. This data needs to be rebuilt frequently depending on many factors. In order to help others I thought I'd post a small code chunk here with a couple brief explanations. Enjoy.

// Map

function(doc, meta) {
  if (doc.type == "invoice") {
    for (var i = 0; i < doc.items.length; i++) {
      var item = doc.items[i];
      emit(
        [ doc.year, item.id, item.unit ],
        { cost: item.cost, quantity: item.quantity}
      );            
    }
  }
}

This map produces a list of items with their cost and quantity. Using the "group" and "group_level" parameters will allow me to group by year, id, and unit (the items that make up the compound key) should I need to do so.

The next step is to aggregate totals for cost and quantity. It is important in this step that I group by unit because it's possible I may have the same item come up more than once but have a different unit. In that case I'd need to convert to some base unit before aggregating, but I'll save that for later for the sake of keeping this simple.

// Reduce

function(key, values, rereduce) {
  var result = {
    TotalCost: 0,
    TotalQuantity: 0,
    ItemCount: 0
  };

  for(var i = 0; i < values.length; i++) {
    if (rereduce) {
      result.TotalCost += values[i].TotalCost;
      result.TotalQuantity += values[i].TotalQuantity;
      result.ItemCount += values[i].ItemCount;
    } else {  
      result.TotalCost = values[i].cost;
      result.TotalQuantity = values[i].quantity;
      result.ItemCount = 1;
    }
  }
  return(result);
}

In this example, I am taking advantage of the rereduce parameter that is managed by Couchbase itself. This is important and it confused me quite a bit for a little while. Couchbase Server uses internal logic to determine if rereduce is true or false, this is NOT something you provide but you can control it depending on how you setup your reduce function.

For more info please see the following link: http://www.couchbase.com/docs/couchbase-devguide-2.0/understanding-custom-reduce.html.

Think of Reduce/Rereduce as two passes. One through the values coming from your map and another through the results of the reduce logic itself. The first pass takes the values from your map and converts them to another values collection that is recursively passed back into your reduce function this time with the "rereduce" boolean set to true. Executing the rereduce part of the above if statement will aggregate your totals and return you a collection of rows based on your grouping.

I would also suggest looking up the Couchbase documentation for how to convert sql to Map/Reduce. That proved to be very helpful in my case.



2 comments:

Mervin Parmar said...

Hadoop is one of the best cloud based tool for analysisng the big data. With the increase in the usage of big data there is a quite a demand for hadoop professionals.
Big data training in Chennai | Hadoop training Chennai | Hadoop training in Chennai

Jhon Abraham said...

Very interesting piece of information.Keep posting stuff like this.
Regards,
Hadoop Training Chennai | Best Hadoop Training in Chennai | Best Hadoop Training institute in Chennai