Category Archives: HBase

Add compression to existing HBase table?

Hello Everyone,

Recently, I had to add compression to a HBase table that already had million of lines. Why adding the compression in HBase? Firstly to reduce the needed space in HDFS and secondly you will reduce the iops needed. If the block saved in HDFS is 4 times smaller then you will need 4 times less iops to get the info. In another hand you are going to increase the CPU consumption to compress/uncompress the data. So you have to keep in mind that it can have a relative big impact on the CPU load.

Here’s a very short How to. Maybe one day you will need the info! šŸ™‚

Firstly, you need to disable the table, in my case the table is “events” in the HBase shell:

disable ‘events’

Then, you can make a “alter table”. you can choose on which column family you want to have compression and you can also choose different kind of algorithm compression (SNAPPY, LZO, GZ or LZ4).

I choose SNAPPY because it’s one than use the least CPU.

alter ‘events’, {NAME=>’metadata’, COMPRESSION=>’SNAPPY’}, {NAME=>’data’, COMPRESSION=>’SNAPPY’}



Once it’s done, you can re-enable the table

enable ‘events’

Now if you have a look at the size (in HDFS) of the table you are not going to see the difference. You need to wait for a major compaction. With the default installation of Hortonworks it’s every 7 days. You can launch manually a major compaction with the following command in the HBase shell:

major_compact ‘events’

Wait a few minutes depending on the size of the table and the CPU resource available on your cluster.

That’sĀ  it, you added snappy compression to your table. šŸ™‚Ā  In my case, it reduce by 4 the size.


HBase architecture

It is a NoSQL (Not Only SQL) database, more particularly a non-relational distributed column oriented database. It was inspired by Google is BigTable and developed by the Apache foundation. It works on top of the HDFS and use the library MapReduce. All the major Hadoop distribution provides HBase as a default package.

When should we use it?

This kind of database has to be used when you have millions or billions of rows in your database. You can not just pass from an RDBMS (Relational DataBase Management System) to an HBase by changing the JDBC. When you do it you have to completely redesign your application because the schema is not fixed. The number of columns of a row to another can be different.

HBase architecture

It is based on the master/slave architecture and is made out of two components. TheHMaster will be the master node and the HRegionservers are the slave nodes.

The HMaster is responsible for all the administrative operations. The data are stored in a table and this table is stored in a HRegion. If the table becomes too big, it splits into several HRegions. A region is stored on a HRegionServer, so the table can be stored all across the cluster.

The HMaster is also responsible for the allocation of the regions to HRegionservers and the control of the load. This means that it ensures that all the HRegions are not stored on the same HRegionservers.

As you can see in the next figure, the HRegionservers stores the HRegion and to do that it has several components.

HBase architecture
HBase architecture
  • HLog: He saves all the necessary data about each region saved on this HRegionserver and all the edits.
  • MemStore: It is an in-memory storage to store the log and the data. When there is enough data it writes it on the HDFS. The reason is that the HDFS allows you to write once a file or block of the file. If you have some edits to do you have to rewrite the complete block and this takes time. It is why the MemStore uses the in-memory processing to improve the performance.
  • StoreFile: It is the HDFS files where the data are saved when the MemStore is full. The management of these can introduce performance consideration that you need to be aware, it will be explained in detail in the next paragraph.

Flush and compactions

When we insert or modify a row in the database, it is firstly written in memory into the memstore. When the memstore reaches a specific size the data will be flushed to a new HDFS file by the intermediate of the storefile. When the HBase region has 3 storefiles and a full memstore, it will perform what we call a compaction. It will create a bigger storefile with the content of the 3 storefiles and the memstore.


HBase compactions
HBase compactions

HBase compactions figure is a chronogram of HBase flushing and compaction, as you can see the next compaction procedure will happen when you have 4 storefiles, it is because the compaction procedure follows the following algorithm:

  1. Parse all the stored files from the oldest to the youngest.
  2. If there are more than 3 storefiles left and the current store file is 20% larger than the sum of all the younger storefiles, and if it is larger than the memstore flush size, then we go to the next younger storefile and repeat step 2.
  3. If one of the conditions in step two is not valid, the storefiles from the current one to the youngest one are the ones that will be merged together. If there is less than the compaction threshold, no merge will be performed. There is also a limit which prevents more than 10 storefiles to be merged in one compaction.

The goal of doing compaction is to limit the number of store files if the data grow to really big size. To limit the number of store implies to reduce the number of entries into the HBase master table and the metadata table of the NameNode.