Category Archives: HDFS

Hadoop: HDFS Optimizations

The default configurations of all the Hadoop distributions are not optimal for an application. In Hadoop, you can configure a lot of different settings that are going to modify the comportment of the cluster. All these settings have a default value given by Hadoop, but the different distributions sometime changes this setting to have good performance regardless the application is running on your cluster.

In this article I will explain some of the settings that you can customize to improve the performance of HDFS but also explain you the impacts of each of these settings.

Before reading this article, I recommend you to read the following article to clearly understand this post.

Heap Size of the Namenode and Datanode

 

The Namenode and the Datanode are applications that run in a JVM. At each JVM you have to define the maximum memory that can be allocated to this application. To determine the size of the heap for the NameNode you need to know the maximum of data blocks that your cluster will store. The common rule is 1GB for one million of data blocks. For the Datanode, it depends on what you do and especially on the amount of data that can be sent at the same time to the same DataNode. The heap size will play into the role of a buffer if the throughput of the disk is not enough. Cloudera and Hortonwork give you the possibility to change these settings directly from the graphical interface. For MapR you need to change the settings in the following file: /opt/mapr/conf/warden.conf.

 

Size of the blocks

When a file is stored in HDFS, it is split into blocks of a specific size. This size can have a big impact on the performance if you have to read several times a file. For example, if you have a block size of 128MB and you want to read a file of 10GB from an external client of the cluster. This file has been split into 79 blocks saved across the cluster, so if you want to read it, HDFS needs to ask for 79 blocks from different location in the cluster to rebuild the file. It also means that you have 79 entries in the metadata table on the NameNode.

If you change the block size for HDFS to 256MB, you are going to divide by 2 the number of entries in your metadata table and also divide by 2 the number of necessary connections to get all the blocks.

Why not set the block size to 10GB?

Because it means that when you want to read it from an external client, you will only have one connection. If you have the file split on more servers the file can be rebuilt faster because the transfer can be done in parallel. It is why when you set the block size you need to know the average size of your file, not to have too many connections at the same time to the client but enough to get a file back as fast as possible.

Moreover, you need to remember that in this environment the data are processed where they are located. So if you have a block size of 10GB it means that the file will be on maximum 3 servers if you have a block replication of 3. Then the data will be processed on only three servers or you would need to send this data across the network to the other servers to be processed and you would lose much time.

Actually, the common size is 128MB or 256MB. To change this setting you have to change the following setting in the “$HADOOP_HOME/conf/hdfs-site.xml” file.

<property>
<name>dfs.block.size </name>
<value>128</value>
</property>
 

Block replication

I think that this setting is the setting that has the biggest impact on the performance to write the data into HDFS. If you have the block replication set higher than 1, when a block of the file has been written the replication begins and the block is read again to be sent to the other servers. A hard disk is not able to read and write at the same time so you are going to lose time for writing.

As you can see in the benchmark of DFSIO, the throughput in write can be multiplied by two or three depending on the distributions.

The default value for the replication is 1, but you can change it by changing the following setting in the “$HADOOP_HOME/conf/hdfs-site.xml” file. This setting applies to all new files unless it has been defined differently when the file was created.

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

 

Compression of the data

If you have a huge quantity of data to write/read you can compress it to save space on the disk and write/read faster but this will involve using more CPU. Compression can also help decrease the network bandwidth utilization.

There are different compressions codecs that HDFS can use, they are listed in the Tableau 2: Compression codecs.

Codec Name Java class
 
DefaultCodec org.apache.hadoop.io.compress.DefaultCodec
GzipCodec org.apache.hadoop.io.compress.GzipCodec
BZip2Codec org.apache.hadoop.io.compress.BZip2Codec
SnappyCodec org.apache.hadoop.io.compress.SnappyCodec
LzoCodec org.apache.hadoop.io.compress.LzoCodec

To configure the compression you need to change some settings in the file “$HADOOP_HOME/conf/mapred-site.xml”.

  • Enable the output compression of a map:
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
 
  • Specify the compression codec:
 <property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

 You can change the value with one of the entries in the Tableau 2: Compression codecs.

 

  • Change the compression type:
 <property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>

 

After any change of the mapred-site.xml file, you need to restart the service Map/Reduce for the changes to take effect.

 

Balancing the data blocks

 

When a file is saved in HDFS, it is split in multiple data blocks. If you have 6 nodes, but the data blocks are distributed across only 3 nodes, it means that when you will perform any process on the data only 3 nodes will have the data in its local storage and the others will need to get the data across the network, that will be more demanding in terms of time. You can balance the block across the node, but also across the local disk to increase the throughput.

From my experience, all the distributions automatically balance the data blocks when they are created. If you want to launch the rebalancing of the data blocks, run the following command on the namenode.

hadoop balancer –threshold 0.2

The threshold option defines the maximum threshold your cluster can have.

There is a setting in HDFS that allows you to specify the bandwidth that can be used during the rebalancing. You can configure it in the $HADOOP/conf/hdfs-site.xml, with the following property.

<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>10485760</value>
</property>

HDFS

Hadoop distributed file system (HDFS) is a software that plays the role of the file system. It is shared between all the nodes of the cluster and gives the possibility to store huge quantities of data. In theory, there is no limit. In practice many companies have a cluster with several petabytes. The main principle is that all files will be split into several parts of 128Mo, all the parts of the file will be sent to different nodes and each part will be replicated.

Architecture

The HDFS is based on a master/slave architecture. It has several components that make machinery work well: the name node (the master), the data nodes (slaves), the secondary node and the balancer node. Figure 1, shows us the interaction between some of them.

HDFS architecture
Figure 1: HDFS architecture

DataNode

This role is played by a host which is going to store the data on its local hard disk drives. All the disks have to be configured in the JBOD mode. It means that the OS (Operating Systems) sees all the disks separately without any RAID configuration. The reason is that we do not need a replication of the data on the same host. The replication is done through the different DataNodes (hosts). When a file has to be stored, it is split into several blocks of 128 MB and the blocks are distributed among the data nodes.

It can be interesting to change the size of the blocks to improve the performance of the cluster. There is no instance limit of this role.

NameNode

This role is unique and played by one host which is the “scheduler” of the architecture and has several responsibilities such as:

  •     To manage the namespace.
  •     To maintain the filesystem tree.
  •     To maintain all the metadata for all files and the localization of each block of files.
  •     To ensure that each block are replicated on different nodes.

When a client wants to read a file or make any process of any data, the client is going to ask firstly to the name node for the metadata, to know to whitch data node he has to ask the data.

Balancer Node

Its job is relatively simple: it looks at the repartition of the data across the cluster and order to move the data if the repartition is not good. It often happens when we had some disk failure or node failure.

High avaibility

In this architecture, if the NameNode fails all the distributed file sytem will be completly blocked! It’s why it’s possible to add a second NameNode that will be in stanby and will take the hand if the actif NameNode has a problem.

All the Hadoop distributions allow you to have only one NameNode at the same time except one! MapR gives you the possibility to have several actif NameNode because they have developed it own distributed file system. It’s completely compatible with the open source HDFS but you have to buy the MapR distribution.