Hadoop distributed file system (HDFS) is a software that plays the role of the file system. It is shared between all the nodes of the cluster and gives the possibility to store huge quantities of data. In theory, there is no limit. In practice many companies have a cluster with several petabytes. The main principle is that all files will be split into several parts of 128Mo, all the parts of the file will be sent to different nodes and each part will be replicated.
The HDFS is based on a master/slave architecture. It has several components that make machinery work well: the name node (the master), the data nodes (slaves), the secondary node and the balancer node. Figure 1, shows us the interaction between some of them.
This role is played by a host which is going to store the data on its local hard disk drives. All the disks have to be configured in the JBOD mode. It means that the OS (Operating Systems) sees all the disks separately without any RAID configuration. The reason is that we do not need a replication of the data on the same host. The replication is done through the different DataNodes (hosts). When a file has to be stored, it is split into several blocks of 128 MB and the blocks are distributed among the data nodes.
It can be interesting to change the size of the blocks to improve the performance of the cluster. There is no instance limit of this role.
This role is unique and played by one host which is the “scheduler” of the architecture and has several responsibilities such as:
- To manage the namespace.
- To maintain the filesystem tree.
- To maintain all the metadata for all files and the localization of each block of files.
- To ensure that each block are replicated on different nodes.
When a client wants to read a file or make any process of any data, the client is going to ask firstly to the name node for the metadata, to know to whitch data node he has to ask the data.
Its job is relatively simple: it looks at the repartition of the data across the cluster and order to move the data if the repartition is not good. It often happens when we had some disk failure or node failure.
In this architecture, if the NameNode fails all the distributed file sytem will be completly blocked! It’s why it’s possible to add a second NameNode that will be in stanby and will take the hand if the actif NameNode has a problem.
All the Hadoop distributions allow you to have only one NameNode at the same time except one! MapR gives you the possibility to have several actif NameNode because they have developed it own distributed file system. It’s completely compatible with the open source HDFS but you have to buy the MapR distribution.