Map Reduce Programming is a simple programming model which is almost always coined up when we talk about Hadoop, a framework for storing, retrieving, processing and
manipulating large amount of data sets which could go up to
multi-terabytes. Hadoop makes use of Map Reduce Programming model to
process this vast amount of data which is spread across large set of
commodity hardware nodes known as clusters, in parallel. The Map Reduce
Programming is specially used for aggregating or segregating the large
dataset in a parallel fashion so that the processing power of all the
nodes in the cluster is put to use and hence, high efficiency and quick
results can be achieved. Here, in this blog I have tried to explain Map
Reduce Programming in context to the Hadoop Framework.
The fundamental data type on which the Map Reduce
programming operates is <Key, Value> pairs. These Key-Value pairs
are analogous to HashMap in Java, Associated Arrays in PHP, Hash in Ruby
and so on. Map Reduce job (program) mainly (not necessarily) consists
two types of Job processes working in tandem. The first one is Map (referred to as Mapper) &
another is Reduce (referred to as Reducer). The input data-set which does not
necessarily has to be in <Key, Value> pairs, is fed into Mapper. The
input is broken down into independent chunks and further into <Key,
Value> pairs and fed to Map jobs. Map job is run exactly once for
each Key, in parallel on nodes, and will break down <Key, Value>
pairs further. In Hadoop, for each input <K, V>, there could zero
or more output <K, V> pairs. So, one input <K, V> pair could
produce one or more than one or do not generate a output <K, V>
at all. Once this operation is completed on all nodes (which is running Map
jobs), the Hadoop sorts the output of the Maps and pass them on to the
Reducer job(s), which could also be running in parallel on various
cluster nodes. While sorting the Maps output, the values within same key
are aggregated & collected to form a list or an array (to be more
specific) and form a new <K, V> pair which is like <K, [V1, V2,
V3...]>. This aggregated <K, V> pair is then input to the Reduce
jobs so that it can be processed further down to produce final <K,
V> pairs.
Let me explain this by an example. I am taking an
example of small & simple program WordCount 1.0 which can be found
on Hadoop site. This example will get you started with Map Reduce Programming and you'll
get the feel of how it works.
get the feel of how it works.
The Wordcount program simply counts number of occurrences of each word in
given input set (two text files in this case)
given input set (two text files in this case)
So, lets assume we have two text files:-
File01.txt
Hello World Bye World
File02.txt
Hello Hadoop Goodbye Hadoop
Before
feeding these input files to Map jobs, these files are broken into
chunks
separated by newlines. So, each map job will receive exactly one line at a time
to process. It will further break
separated by newlines. So, each map job will receive exactly one line at a time
to process. It will further break
it
down along the white spaces and create Key-Value for each word found
like <(Word), 1>. In our case the output of the map jobs will be like:-
like <(Word), 1>. In our case the output of the map jobs will be like:-
<Hello, 1>
<World, 1>
<Bye, 1>
<World, 1>
<Hello, 1>
<Hadoop, 1>
<Goodbye, 1>
<Hadoop, 1>
Once all the map jobs are done, these Key-Values are sorted by key and pushed
to the Reducer jobs. Reducer aggregates these pairs to emit new Key-Value
pairs like this:-
<Bye, 1>
<Goodbye, 1>
<Hadoop, 2>
<Hello, 2>
<World, 2>
Note
the output. The keys are lexicographically sorted and aggregated. These
output can be saved to output text files or depending on the needs, can
be pushed into Database or other data sources. In context to Hadoop,
Map Reduce programs can have number of Mapper and Reducer jobs which are
customizable. Reducer jobs are completely optional and can be omitted for
certain jobs.
I hope this small tutorial on Map Reduce Programming
helped you understand the concept and get you started. In the meantime, if you have any questions/concerns, you can contact me at developer.vineet@gmail.com or you can comment here. You can also checkout this content at my blog http://www.technologywithvineet.com
No comments:
Post a Comment