Hello Everyone,
Recently, I had to add compression to a HBase table that already had million of lines. Why adding the compression in HBase? Firstly to reduce the needed space in HDFS and secondly you will reduce the iops needed. If the block saved in HDFS is 4 times smaller then you will need 4 times less iops to get the info. In another hand you are going to increase the CPU consumption to compress/uncompress the data. So you have to keep in mind that it can have a relative big impact on the CPU load.
Here’s a very short How to. Maybe one day you will need the info! š
Firstly, you need to disable the table, in my case the table is “events” in the HBase shell:
disable ‘events’
Then, you can make a “alter table”. you can choose on which column family you want to have compression and you can also choose different kind of algorithm compression (SNAPPY, LZO, GZ or LZ4).http://whatsbigdata.be/wp-admin/post-new.php
I choose SNAPPY because it’s one than use the least CPU.
alter ‘events’, {NAME=>’metadata’, COMPRESSION=>’SNAPPY’}, {NAME=>’data’, COMPRESSION=>’SNAPPY’}
Once it’s done, you can re-enable the table
enable ‘events’
Now if you have a look at the size (in HDFS) of the table you are not going to see the difference. You need to wait for a major compaction. With the default installation of Hortonworks it’s every 7 days. You can launch manually a major compaction with the following command in the HBase shell:
major_compact ‘events’
Wait a few minutes depending on the size of the table and the CPU resource available on your cluster.
That’sĀ it, you added snappy compression to your table. šĀ In my case, it reduce by 4 the size.