Hbase history data clean up (Quick Read)
How can I delete the history data of a certain period? What exactly happens in the background during the deletion process?
What is compaction?
Process of combining all HFiles into a large single HFile, to minimize seeks needed to read data. It has two types of compaction, Minor Compaction, Major Compaction.
Which compaction do we need to run after our deletion process?
During major compaction, all deleted and expired cells are deleted, while during minor compaction stored HFiles contains deleted once too. So after modifying the TTL value(mentioned below), we need to run the major compaction.
Is it possible to delete the entire history data of a certain period for an HBase table at once?
We can able to do it at each column family level of a table, so the process needs to be repeated for all column families in HBase table.
What is the deletion process?
We need to set the Time To Leave(TTL, timestamps older than this will be deleted in major compaction i.e Hbase files are rewritten without deleted data) value for a column family, after that run major compaction on that column family if it needs to apply to the entire table then we need to set TTL to each column family in the table. We can change the TTL value back to normal if we are doing it as a one-time activity.
Recommendation: Run the major compaction when there is less load on HBase cluster as during this process disk I/O will be high, as it rewrites the data in StoreFiles( StoreFile is a facade of HFile).
How major compaction identifies the deleted/expired HFiles?
Deletion marker is written when explicit deletion happens, during the compaction, these are identified and processed accordingly. Deletion marker is also called tombstone. If deletion happens because of an expired TTL, no tombstone is created,the expired data is filtered out and is not written back to the compacted StoreFile.
How to set the TTL and run major compaction?
In hbase shell:
alter ‘table name’ , NAME => ‘columnfamily’ ,TTL=>value in seconds
major_compact ‘table name’, ‘columnfamily’
Even after running major compaction, I still see data not deleted in HBase, what should i do?
Check whether any snapshots related to the table are saved, delete the snapshot if not earlier deleted data will be moved to archive during major compaction.
Reference: Apache Hbase Reference Guide and Cloudera documentation.
Main objective of this article is to create a lineage of information about HBase data deletion. Watch out for this space for more such articles related to Hadoop.