Wednesday, October 7, 2015

How to begin with a new HBase Application

Assumption:
- You are clear on reasons of choosing HBase for implementation.
- You have basic knowledge of HBase

Begin with HBase

1. Understand and be clear with the client requirement. Primarily on how they want to retrieve the output.
2. Design HBase schema based on the clients retrieval criteria.
3. Be very very particular about the row key designing.
4. Design schema in such a way that it needs minor or no changes.
5. Since space is not a constraint in HBase, choose to design multiple tables if needed, focussing on the search speed to be within 1ms (if found in cache) to 10ms(if not found in cache)
6. Remember, table value based search needs full table scan across all regions and gives least performance.
7. Give maximum time to designing of schema. Coding will take minimum time, and it would turn out to be a full proof customer delight product.

Tips on Row key design


Thinking of our row key like a funnel. Start with the piece of data we always need to partition on, and narrow it down to the more specific data that we don’t often need to distinguish

Based on access pattern, either uses sequential or random keys. 
ØOften a combination of both is needed.
ØOvercomes architectural limitations. 
ØNeither is necessarily bad. 
ØUse bulk import for sequential keys and reads. 
ØRandom keys are good for random access patterns

Sequential Keys are the ones which has timestamp in the beginning. This leads to hotspotting.  
<timestamp><more key>: {CF: {CQ: {TS : Val}}}
Hotspotting on Regions is considered bad, instead do one of the following:
ØSalting
ØPrefix <timestamp> with distributed value
ØBinning or bucketing rows across regions
ØPrefix row keys to gain spread
ØUse well known or numbered prefixes
ØUse modulo to spread across servers
ØEnforce common data stay close to each other for subsequent scanning or  MapReduce processing
ØKey field swap/promotion
ØMove <more key> before the timestamp (later)
ØRandomization
ØMove <timestamp> out of key