Do Big with Big Data: September 2013

Map Reduce Programming is a simple programming model which is almost always coined up when we talk about Hadoop, a framework for storing, retrieving, processing and manipulating large amount of data sets which could go up to multi-terabytes. Hadoop makes use of Map Reduce Programming model to process this vast amount of data which is spread across large set of commodity hardware nodes known as clusters, in parallel. The Map Reduce Programming is specially used for aggregating or segregating the large dataset in a parallel fashion so that the processing power of all the nodes in the cluster is put to use and hence, high efficiency and quick results can be achieved. Here, in this blog I have tried to explain Map Reduce Programming in context to the Hadoop Framework.

The fundamental data type on which the Map Reduce programming operates is <Key, Value> pairs. These Key-Value pairs are analogous to HashMap in Java, Associated Arrays in PHP, Hash in Ruby and so on. Map Reduce job (program) mainly (not necessarily) consists two types of Job processes working in tandem. The first one is Map (referred to as Mapper) & another is Reduce (referred to as Reducer). The input data-set which does not necessarily has to be in <Key, Value> pairs, is fed into Mapper. The input is broken down into independent chunks and further into <Key, Value> pairs and fed to Map jobs. Map job is run exactly once for each Key, in parallel on nodes, and will break down <Key, Value> pairs further. In Hadoop, for each input <K, V>, there could zero or more output <K, V> pairs. So, one input <K, V> pair could produce one or more than one or do not generate a output <K, V> at all. Once this operation is completed on all nodes (which is running Map jobs), the Hadoop sorts the output of the Maps and pass them on to the Reducer job(s), which could also be running in parallel on various cluster nodes. While sorting the Maps output, the values within same key are aggregated & collected to form a list or an array (to be more specific) and form a new <K, V> pair which is like <K, [V1, V2, V3...]>. This aggregated <K, V> pair is then input to the Reduce jobs so that it can be processed further down to produce final <K, V> pairs.

Let me explain this by an example. I am taking an example of small & simple program WordCount 1.0 which can be found on Hadoop site. This example will get you started with Map Reduce Programming and you'll
get the feel of how it works.

The Wordcount program simply counts number of occurrences of each word in
given input set (two text files in this case)

So, lets assume we have two text files:-

File01.txt

Hello World Bye World

File02.txt

Hello Hadoop Goodbye Hadoop

Before feeding these input files to Map jobs, these files are broken into chunks
separated by newlines. So, each map job will receive exactly one line at a time
to process. It will further break

it down along the white spaces and create Key-Value for each word found
like <(Word), 1>. In our case the output of the map jobs will be like:-

<Hello, 1>

<World, 1>

<Bye, 1>

<World, 1>

<Hello, 1>

<Hadoop, 1>

<Goodbye, 1>

<Hadoop, 1>

Once all the map jobs are done, these Key-Values are sorted by key and pushed to the Reducer jobs. Reducer aggregates these pairs to emit new Key-Value pairs like this:-

<Bye, 1>

<Goodbye, 1>

<Hadoop, 2>

<Hello, 2>

<World, 2>

Note the output. The keys are lexicographically sorted and aggregated. These output can be saved to output text files or depending on the needs, can be pushed into Database or other data sources. In context to Hadoop, Map Reduce programs can have number of Mapper and Reducer jobs which are customizable. Reducer jobs are completely optional and can be omitted for certain jobs.

I hope this small tutorial on Map Reduce Programming helped you understand the concept and get you started. In the meantime, if you have any questions/concerns, you can contact me at developer.vineet@gmail.com or you can comment here. You can also checkout this content at my blog http://www.technologywithvineet.com

Do Big with Big Data

Monday, September 2, 2013

Map Reduce Programming Concepts

About Me