One of the main concepts that makes big data work in practice is that the data have to be processed where they have been stored, we call that the “data locality”. So the processing of the data will be made in parallel on all the DataNodes that have the data that we want to treat.
A MapReduce algorithm is a programming model that will implement 2 parts, the mapping part and the reducing part.
The mapping consists of a treatment that takes input data and maps it to <key, value> pairs. The reducing takes all the <key,value> with the same key and made a process on it .
How is it working?
Example in the real life
To explain how it works, let’s take an example. Imagine that you want to count the number of instances of each word in a book of 100 pages in one hour. How are you going to do that?
- You are going to ask 26 of your colleagues to come and help you.
- To each colleague, you give a page of the book and ask him to put each word on another page, one word per page.
- When a colleague finishes a page you give him another one. If a colleague becomes sick, you take back his page and give it to another colleague.
- When all pages have been done, you put on a table 26 boxes. Each box represents a letter of the alphabet, you ask all your colleagues to put all the words that begin with the same letter in the same box.
- You ask everyone to take a box and to sort alphabetically all the pages in the boxes.
- When the sort is done, you ask your colleagues to count the number of pages that have the same word and to give you the results.
After these 6 steps, if everyone has respected all the rules, you have the number of instances of each word in less than one hour.
Examples in Hadoop with MapReduce framework
In the next figure, we see the processes to make a word count distributed on many hosts.
First of all, you have in the input the text file that is going to be split row by row. Each row will be “mapped” by a host. The mapping step will produce many associations of <key, value> pairs, in this case the key will be the word and the value “1” because it is one word.
In a second time, all the key value pairs produced by the entire host will be sorted with the help of the key. This is the most complicated part, but fortunately it is made by the MapReduce library.
Finally, the entire node will perform a “reduce” task. They are going to count all the key value pairs which have the same key. The final result that you are going to get is the number of instances of each word in the file.
JobTracker & TaskTracker
In a MapReduce environment, we have 2 kinds of nodes, the JobTracker and the TaskTracker. The first is the node that schedules the entire map and reduces tasks that are given to a TaskTracker. In the next figure, there is an example of interaction between the two when a client asks to execute a MapReduce program.