Tag Archives: Hadoop optimization

Hadoop: HDFS Optimizations

The default configurations of all the Hadoop distributions are not optimal for an application. In Hadoop, you can configure a lot of different settings that are going to modify the comportment of the cluster. All these settings have a default value given by Hadoop, but the different distributions sometime changes this setting to have good performance regardless the application is running on your cluster.

In this article I will explain some of the settings that you can customize to improve the performance of HDFS but also explain you the impacts of each of these settings.

Before reading this article, I recommend you to read the following article to clearly understand this post.

Heap Size of the Namenode and Datanode

 

The Namenode and the Datanode are applications that run in a JVM. At each JVM you have to define the maximum memory that can be allocated to this application. To determine the size of the heap for the NameNode you need to know the maximum of data blocks that your cluster will store. The common rule is 1GB for one million of data blocks. For the Datanode, it depends on what you do and especially on the amount of data that can be sent at the same time to the same DataNode. The heap size will play into the role of a buffer if the throughput of the disk is not enough. Cloudera and Hortonwork give you the possibility to change these settings directly from the graphical interface. For MapR you need to change the settings in the following file: /opt/mapr/conf/warden.conf.

 

Size of the blocks

When a file is stored in HDFS, it is split into blocks of a specific size. This size can have a big impact on the performance if you have to read several times a file. For example, if you have a block size of 128MB and you want to read a file of 10GB from an external client of the cluster. This file has been split into 79 blocks saved across the cluster, so if you want to read it, HDFS needs to ask for 79 blocks from different location in the cluster to rebuild the file. It also means that you have 79 entries in the metadata table on the NameNode.

If you change the block size for HDFS to 256MB, you are going to divide by 2 the number of entries in your metadata table and also divide by 2 the number of necessary connections to get all the blocks.

Why not set the block size to 10GB?

Because it means that when you want to read it from an external client, you will only have one connection. If you have the file split on more servers the file can be rebuilt faster because the transfer can be done in parallel. It is why when you set the block size you need to know the average size of your file, not to have too many connections at the same time to the client but enough to get a file back as fast as possible.

Moreover, you need to remember that in this environment the data are processed where they are located. So if you have a block size of 10GB it means that the file will be on maximum 3 servers if you have a block replication of 3. Then the data will be processed on only three servers or you would need to send this data across the network to the other servers to be processed and you would lose much time.

Actually, the common size is 128MB or 256MB. To change this setting you have to change the following setting in the “$HADOOP_HOME/conf/hdfs-site.xml” file.

<property>
<name>dfs.block.size </name>
<value>128</value>
</property>
 

Block replication

I think that this setting is the setting that has the biggest impact on the performance to write the data into HDFS. If you have the block replication set higher than 1, when a block of the file has been written the replication begins and the block is read again to be sent to the other servers. A hard disk is not able to read and write at the same time so you are going to lose time for writing.

As you can see in the benchmark of DFSIO, the throughput in write can be multiplied by two or three depending on the distributions.

The default value for the replication is 1, but you can change it by changing the following setting in the “$HADOOP_HOME/conf/hdfs-site.xml” file. This setting applies to all new files unless it has been defined differently when the file was created.

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

 

Compression of the data

If you have a huge quantity of data to write/read you can compress it to save space on the disk and write/read faster but this will involve using more CPU. Compression can also help decrease the network bandwidth utilization.

There are different compressions codecs that HDFS can use, they are listed in the Tableau 2: Compression codecs.

Codec Name Java class
 
DefaultCodec org.apache.hadoop.io.compress.DefaultCodec
GzipCodec org.apache.hadoop.io.compress.GzipCodec
BZip2Codec org.apache.hadoop.io.compress.BZip2Codec
SnappyCodec org.apache.hadoop.io.compress.SnappyCodec
LzoCodec org.apache.hadoop.io.compress.LzoCodec

To configure the compression you need to change some settings in the file “$HADOOP_HOME/conf/mapred-site.xml”.

  • Enable the output compression of a map:
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
 
  • Specify the compression codec:
 <property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

 You can change the value with one of the entries in the Tableau 2: Compression codecs.

 

  • Change the compression type:
 <property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>

 

After any change of the mapred-site.xml file, you need to restart the service Map/Reduce for the changes to take effect.

 

Balancing the data blocks

 

When a file is saved in HDFS, it is split in multiple data blocks. If you have 6 nodes, but the data blocks are distributed across only 3 nodes, it means that when you will perform any process on the data only 3 nodes will have the data in its local storage and the others will need to get the data across the network, that will be more demanding in terms of time. You can balance the block across the node, but also across the local disk to increase the throughput.

From my experience, all the distributions automatically balance the data blocks when they are created. If you want to launch the rebalancing of the data blocks, run the following command on the namenode.

hadoop balancer –threshold 0.2

The threshold option defines the maximum threshold your cluster can have.

There is a setting in HDFS that allows you to specify the bandwidth that can be used during the rebalancing. You can configure it in the $HADOOP/conf/hdfs-site.xml, with the following property.

<property>
<name>dfs.balance.bandwidthPerSec</name>
<value>10485760</value>
</property>