Assumption:
- You are clear on reasons of choosing HBase for implementation.
- You have basic knowledge of HBase
Begin with HBase
1. Understand and be clear with the client requirement. Primarily on how they want to retrieve the output.
2. Design HBase schema based on the clients retrieval criteria.
3. Be very very particular about the row key designing.
4. Design schema in such a way that it needs minor or no changes.
5. Since space is not a constraint in HBase, choose to design multiple tables if needed, focussing on the search speed to be within 1ms (if found in cache) to 10ms(if not found in cache)
6. Remember, table value based search needs full table scan across all regions and gives least performance.
7. Give maximum time to designing of schema. Coding will take minimum time, and it would turn out to be a full proof customer delight product.
Tips on Row key design
- You are clear on reasons of choosing HBase for implementation.
- You have basic knowledge of HBase
Begin with HBase
1. Understand and be clear with the client requirement. Primarily on how they want to retrieve the output.
2. Design HBase schema based on the clients retrieval criteria.
3. Be very very particular about the row key designing.
4. Design schema in such a way that it needs minor or no changes.
5. Since space is not a constraint in HBase, choose to design multiple tables if needed, focussing on the search speed to be within 1ms (if found in cache) to 10ms(if not found in cache)
6. Remember, table value based search needs full table scan across all regions and gives least performance.
7. Give maximum time to designing of schema. Coding will take minimum time, and it would turn out to be a full proof customer delight product.
Tips on Row key design
Thinking of our row key like a funnel. Start with the
piece of data we always need to partition on, and narrow it down to the more
specific data that we don’t often need to distinguish
Based on access pattern, either uses sequential or random
keys.
ØOften a combination of both is needed.
ØOvercomes architectural limitations.
ØNeither is necessarily bad.
ØUse bulk import for sequential keys and reads.
ØRandom keys are good for random access patterns
Sequential Keys are the ones which has timestamp in the
beginning. This leads to hotspotting.
<timestamp><more key>: {CF:
{CQ: {TS : Val}}}
Hotspotting on Regions is considered bad, instead do one of the
following:
ØSalting
ØPrefix <timestamp> with distributed value
ØBinning or bucketing rows across regions
ØPrefix row keys to gain spread
ØUse well known or numbered prefixes
ØUse modulo to spread across servers
ØEnforce common data stay close to each other for subsequent
scanning or MapReduce processing
ØKey field swap/promotion
ØMove <more key> before the timestamp (later)
ØRandomization
ØMove <timestamp> out of key