Sunday, February 14, 2016

Star Schema in Hadoop

A star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions.  A fact is an event that is counted or measured, such as a sale or login.  A dimension contains reference information about the fact, such as date, product, or customer. A star schema is diagramed by surrounding each fact with its associated dimensions.

Flat File Representation

Star Schema Representation


Data Load Process



Experiement1: Flat file vs Star Schema


Experiment 2 : Compressed Sequence Files

Structuring data properly in Hive is as important as in an RDBMS. The decision to store data in a sequence file format alone accounted for a performance improvement of more than 1,000%. The judicious use of indexes and partitions resulted in significant performance gains by reducing the amount of data processed.

For data architects working in the Hive environment, the good news is that many of the same techniques such as indexing that are used in a traditional RDBMS environment are applicable. For those of us who are familiar with MPP databases, the concept of partitioned data across nodes is very familiar.

The key takeaway is that we need to understand our data and the underlying technology in Hadoop to effectively tune our data structures. Simply creating a flat table or star schema does not result in optimized structures. We need to understand how our data is distributed, and we need to create data structures that work well for the access patterns of our environment. Being able to decipher MapReduce job logs as well as run explain plans are key skills to effectively model data in Hive. We need to be aware that tuning for some queries might have an adverse impact on other queries as we saw with partitioning.

* This holds true only for HIVE (not HBase)

(Excerpts from research on Data modeling in Hadoop(Hive) with Star Schema)
(References: data-modeling-hadoop.pdf, http://searchdatamanagement.techtarget.com)

No comments:

Post a Comment