Map Reduce …What is it all about? How does it work in Hadoop?

Image

Map Reduce is basically a programming model that is used to implement parallelism when processing algorithms. It can be basically be viewed as a three stage process : map, shuffle and Reduce. One can think of map reduce to be analogous to the merge sort or the quick sort operations where we have got a splitting operation and then a joining operation.

Map Reduce is a concept totally inspired by the functional programming languages. Well don’t be scared, one doesn’t need to understand functional programming language to read this. Functional programming languages are where we can pass functions as arguments to another function. Lets see all this using an example of Map and Reduce.

Suppose we want to double all numbers stored in an array. A simple program would be:

int array=[1,2,3];

for(int i=0;i<array.length;i++)

array[i]=array[i]*2;

An Hadoop Map implementation of this simple code would be :

map({function(x) {return x*2;},array)

Alright you may think agreed that this saves some lines of codes, but is this the only use. Well, here we have an array containing just 3 elements, imagine a situation in which you have billions of such elements in an array to be processed and you have thousands of systems, then each system can just execute this map operation in parallel and thus save a lot of time.

A Reduce operation is also similar as map operation. Let’s see a very easy example. Now that you have doubled every element using the map operation, you want to add-up all these values. What do you do? You can very well a small code:

int sum=0;

for(int i=0;i<array.length,i++)

s=s+array[i];

The same piece of code can be written as

Reduce(function(a,b) return{a+b;},a,0)

This function says that a and b can be added and the result is stored in a. The initial value is signified by 0. Now that we have seen what map and reduce functions are it is time we see how these functions are implemented in Hadoop. In addition to just sum or reduce functions we have an all important shuffle function about which we will discuss in detail in another post.

Lets look into the aspect what really happens when a Map Reduce job is submitted to a haddop machine. There are 8 steps that take place:

1. The Map Reduce code we write sets up a job client which runs the job.

2. This job client creates a job ID and sends it to a job tracker.

3. Next the job client also sets up a shared file system that has all the job resources required like the jar files that we have written and need to be executed.

4. Once all the required files are available in the shared file system which is usually HDFS, the job client can tell the job tracker that it can start the job.

5. The job tracker then does its own calculations so that it can send input splits( code snippets) to each mapper process to increase efficiency and gain throughput. The input splits are obtained from the HDFS shared file system.

6. Remember from the previous post that the task tracker continuously sends in signals to job tracker telling about its status and seeking any new jobs.

7. Now that the job tracker has jobs, it allocates either a map task or a reduce task to the task tracker which in-turn gets its share of input splits from the HDFS shared file system.

8. After receiving these input splits, a child process is started that may have a JVM to process the files.

The Sequence flow of actions can be better understood seeing this diagram:

Image

Hadoop Architecture

Image

Hadoop Architecture is based on the HDFS (Hadoop file system) and the Map-reduce paradigm. As already discussed before Map reduced is a programming model to process large distributed, parallel algorithms. Map-reduce consists of two types of transformations a mapper and a reducer. A mapper can be divided into many map tasks that can run simultaneously, similarly a reducer task also can be divided into many reduce tasks that can be executed in parallel.

A detailed discussion on map-reduce paradigm will be done in the forthcoming posts. Let us center this post towards seeing the essence of Hadoop architecture.Hadoop architecture is broadly divided into HDFS and the Map-reduce nodes. The nodes are broadly classified as:

  • HDFS Nodes
  1. Name Nodes

Name nodes store all the metadata and consists of the information about location/address of the data blocks.

2. Data Nodes

Data nodes store blocks of data. The concept of HDFS and the blocks have already been discussed in the previous posts.

  • Map Reduce Nodes
  1. Job Tracker Nodes
  2. Task tracker Nodes

A detailed description and functionality of these nodes is discussed below.

It is a common fact that a job comprises of many tasks. Therefore it can be fair to assume that a job tracker actually tracks what a task tracker is doing and then gets the job done by a task tracker.

A task tracker is a node in a hadoop cluster that accepts jobs ( Map and reduce operations) from a job tracker. Every task tracker has certain fixed number of slots to accept tasks. When a job tracker is provided with a task by the client, job tracker looks for task tracker. The job tracker chooses that task tracker that resides in the same server as the data node. If at all none of the slots of this task tracker is free then the job tracker looks for a server which is present on the same rack. This is how hadoop utilizes the concept of rack awareness discussed before. Every task tracker periodically updates the job tracker telling that it is alive and in this process also makes the job tracker aware of number of empty slots.

A job tracker accepts applications from a client. It then talks with the name node and finds out location of the data on which processing needs to be done. Job tracker allocates the job to a task tracker which is responsible for execution of the tasks. Many task trackers can work simultaneously thereby executing processes in parallel. When a work is completed the job tracker updates its status. The client has the liberty to query the job tracker regarding the status of the job.

This diagram illustrates how the different nodes are positioned.

Image

The overall interaction between these nodes is illustrated well by this diagram below:

Image

Hope you are not lost. Forthcoming posts will deal in detail with the map reduce paradigm. continue reading and please post your valuable comments.

HDFS (Hadoop Distributed File system)

Image

Hadoop Distributed file system as the name suggests is a hadoop file system that can utilize a comodity hardware. HDFS is a file system that works as a layer above the normal file system of a computer. HDFS divides the entire data available on a system into equal sized blocks. Each block size can be customized but is generally 128mb each. HDFS is also fault tolerant and also utilizes the memory size very efficiently. For instance, a 576mb of data is usually divided as 128*4 + 64*1. So instead of having five 128mb blocks it has four 128mb blocks and a 64mb block thereby avoiding the extra loss in space.

In addition to this HDFS is fault tolerant. For every block of data, 2 other replicas of this block are stored at two different data storage units, there by making it fault tolerant. The number 2 is by default and can be modified based on the importance of the data block.

HDFS mainly has 2 nodes: Name nodes and Data nodes. Name nodes can be considered as the index of a book. It stores all the meta data and has all the address locations and block locations of all the data nodes. Data nodes are the ones that consist of blocks of data. A data node may have many blocks of data. When a client seeks a section of data, the name node provides the address/ location of this data block to the client.

Hadoop Cluster

Befor proceeding further it is essential to discuss about the topology in which this data nodes are arranged.

Many data nodes constitute a rack. Many such racks constitute a hadoop cluster. Data communication is better in intra rack than in an inter rack setup. This concept is utilized by hadoop and is called as Rack awareness.

The Hadoop architecture and Map reduce paradigm will be discussed in the upcoming posts.

Hadoop

Image

Hadoop is a software framework that is mainly extensively being used for the distributed application processing. Hadoop was created by Dough Cutting named interestingly after his son’s toy elephant named hadoop. Hadoop generally has two components HDFS ( Hadoop Distributed file system) and Map Reduce paradigm. HDFS is inspired and based on a paper published by google on their google file system (GFS). Similarly Map-reduce is also another hadoop technology based on another paper published by google on their map reduce technology. Both these technologies will be explained and illustrated in the next post.

Hope you are enjoying reading this.

Big Data and Hadoop — What can we do with it ?

Image

Big data can’t be used for OLTP type of information systems where random access is required like in the case of the transaction databases. Also Big data can’t be used for the OLAP type of information systems where section of data needs to be constantly analysed for Business intelligence. Then where can it be used.

Big data and Hadoop can be used at those places where batch jobs need to be performed on sequential data for analysis. Usually all the information in different warehouses are stored in a distributed manner and then this large volume of data is divided into large blocks of data and processed individually in a sequential manner.

As it can be seen, the main aim when using this kind of a paradigm should be to limit the number of seek operations and also to avoid lot of computation on small blocks of data. The concept of hadoop and how it can be used is provided in the next post.

Hope you are not getting bored. Please do keep reading

Big Data– What is it all about??

Image

Big Data as the name suggests indeed refers to a large chunk of Data. So what is it? and why does it gain so much attention today? To answer these question we need to go back in time and find out the real essence of it.

During the 1980’s and early 1990’s data storage was a huge problem. We had magnetic tapes and I remember seeing 10 inches floppy discs and then the smaller 3.5 inches ones that could carry a Megabyte of data. Then came the compact discs (CD’s) , then DVD’s , external hard drives and so on. With the advent of cloud related technologies storage of data remains no longer a challenge or a problem. It is believed that in the coming years, we are going to generate and store double the data that we are generating today.

In such a situation when data storage is no longer a problem, companies are creating warehouses where all the previous data is being stored in a compact manner and is being used for analytical purposes. This data is being stored from a number of sources located at different geographic locations. In addition to this the data now a days is not structured i.e the data can’t be sored in a table like format since the type of data that needs to be stored is very different and therefore creating a table with so many attributes is out of question. Therefore Releational data bases like SQL are being phased out and unstructured way of storing data like in case of XML are extensively being used.

Analyzing  such massive scale of data which is unstructured and located in servers that are countries apart is a huge challenge. Here comes the concept of big data. How can such data be stored using the ETL processes into a warehouse and algorithms be run on it to gain useful information. This question is answered in the next post.

Keep reading.

CUDA BASED PROGRAMMING GUIDE

Programming in cuda is very simple .Any person who is good and efficient in c or c++ has got a good oppertunity to learn cuda within hours.The language which we use for the cuda is cuda c.As explaining the whole stuff is never going to be easy I provide u with a link for this and also the file which will help u all to geta clear idea on how programming in cuda is done and N number of threads are executed simultainiously

NVIDIA_CUDA_ProgrammingGuide

Hope this link will give you an idea about how to program in cuda.However or whenever u face any doubths donot feel shy to contact me.

WELCOME

Hello Everyone, Welcome to my blog. This blog is intended to be used to share some interesting facts and  also to be used as a platform to share our knowledge and expertise in the emerging field of Data mining, Big data and Distributed computing . This blog is intended to have active discussion on some of the on going developments in the field of computer science. Please feel free to comment and also enquire on the posts.