Tag Archives: compaction

HBase architecture

It is a NoSQL (Not Only SQL) database, more particularly a non-relational distributed column oriented database. It was inspired by Google is BigTable and developed by the Apache foundation. It works on top of the HDFS and use the library MapReduce. All the major Hadoop distribution provides HBase as a default package.

When should we use it?

This kind of database has to be used when you have millions or billions of rows in your database. You can not just pass from an RDBMS (Relational DataBase Management System) to an HBase by changing the JDBC. When you do it you have to completely redesign your application because the schema is not fixed. The number of columns of a row to another can be different.

HBase architecture

It is based on the master/slave architecture and is made out of two components. TheHMaster will be the master node and the HRegionservers are the slave nodes.

The HMaster is responsible for all the administrative operations. The data are stored in a table and this table is stored in a HRegion. If the table becomes too big, it splits into several HRegions. A region is stored on a HRegionServer, so the table can be stored all across the cluster.

The HMaster is also responsible for the allocation of the regions to HRegionservers and the control of the load. This means that it ensures that all the HRegions are not stored on the same HRegionservers.

As you can see in the next figure, the HRegionservers stores the HRegion and to do that it has several components.

HBase architecture
HBase architecture
  • HLog: He saves all the necessary data about each region saved on this HRegionserver and all the edits.
  • MemStore: It is an in-memory storage to store the log and the data. When there is enough data it writes it on the HDFS. The reason is that the HDFS allows you to write once a file or block of the file. If you have some edits to do you have to rewrite the complete block and this takes time. It is why the MemStore uses the in-memory processing to improve the performance.
  • StoreFile: It is the HDFS files where the data are saved when the MemStore is full. The management of these can introduce performance consideration that you need to be aware, it will be explained in detail in the next paragraph.

Flush and compactions

When we insert or modify a row in the database, it is firstly written in memory into the memstore. When the memstore reaches a specific size the data will be flushed to a new HDFS file by the intermediate of the storefile. When the HBase region has 3 storefiles and a full memstore, it will perform what we call a compaction. It will create a bigger storefile with the content of the 3 storefiles and the memstore.


HBase compactions
HBase compactions

HBase compactions figure is a chronogram of HBase flushing and compaction, as you can see the next compaction procedure will happen when you have 4 storefiles, it is because the compaction procedure follows the following algorithm:

  1. Parse all the stored files from the oldest to the youngest.
  2. If there are more than 3 storefiles left and the current store file is 20% larger than the sum of all the younger storefiles, and if it is larger than the memstore flush size, then we go to the next younger storefile and repeat step 2.
  3. If one of the conditions in step two is not valid, the storefiles from the current one to the youngest one are the ones that will be merged together. If there is less than the compaction threshold, no merge will be performed. There is also a limit which prevents more than 10 storefiles to be merged in one compaction.

The goal of doing compaction is to limit the number of store files if the data grow to really big size. To limit the number of store implies to reduce the number of entries into the HBase master table and the metadata table of the NameNode.