A star schema is the simplest form of a dimensional
model, in which data is organized into facts and dimensions.
A fact is an event that is counted or measured, such as a sale or login.
A dimension contains reference information about the fact, such as date,
product, or customer. A star schema is diagramed by surrounding each fact
with its associated dimensions.
Flat File Representation
|
Star
Schema Representation
|
|
Data Load Process
Experiement1:
Flat file vs Star Schema
Experiment
2 : Compressed Sequence Files
Structuring data properly in Hive is as important as in an RDBMS. The decision to store data in a sequence file format alone accounted for a performance improvement of more than 1,000%. The judicious use of indexes and partitions resulted in significant performance gains by reducing the amount of data processed.
For
data architects working in the Hive environment, the good news is that many of
the same techniques such as indexing that are used in a traditional RDBMS
environment are applicable. For those of us who are familiar with MPP
databases, the concept of partitioned data across nodes is very familiar.
The key
takeaway is that we need to understand our data and the underlying technology
in Hadoop to effectively tune our data structures. Simply creating a flat table
or star schema does not result in optimized structures. We need to understand
how our data is distributed, and we need to create data structures that work
well for the access patterns of our environment. Being able to decipher
MapReduce job logs as well as run explain plans are key skills to effectively
model data in Hive. We need to be aware that tuning for some queries might have
an adverse impact on other queries as we saw with partitioning.
* This holds true only for HIVE (not HBase)
(Excerpts from
research on Data modeling in Hadoop(Hive) with Star Schema)
(References:
data-modeling-hadoop.pdf, http://searchdatamanagement.techtarget.com)
No comments:
Post a Comment