Wednesday, February 17, 2016

Sqoop - Export (Session 5)

HDFS - Sqoop -> RDBMS

Before export, prepare the database by creating the target table.

Example: Exporting zip_profits table from hive to mysql

mysql> Create table ms_zip_profits (zip INT, profit DECIMAL(8,2));
OK

% Sqoop export --connect jdbc:mysql://localhost/hadoopguide -m 1
--table ms_zip_profits --export-dir /user/hive/warehouse/zip_profits 
--input-fields-terminated-by '\t'

OK

% mysql hadoopguide -e 'select * from ms_zip_profits'

How Sqoop Export Works
Architechture of Sqoop Export is very similar to how Sqoop Import works

- Sqoop picks a strategy based on the connect string, jdbc or other
- Then generates a Java class based on target table definition 
            * This class can parse input records from text files 
            * Insert values of appropriate types into tables
            * Ability to read columns from ResultSet
- Launches mapreduce job that
            * Reads source data files from HDFS
            * Parses records using generated Java class
            * Executes the chosen JDBC strategy


Sqoop JDBC based strategy builds up Batch Insert statements, each adding multiple records to the target table.
Separate threads are used to read from HDFS and communicate with database.

Parallelism
Most MR jobs picks the degree of parallelism based on the number and size of files to process, Sqoop uses the CombineFileInputFormat to group up input files into a smaller no of map tasks

Exports
Sqoop can also export records that is not in Hive table. It can export delimited text files that are output of MapReduce jobs. It can also export records in a Sequence file to an output table though some restrictions apply.

Example of sequence file import-export

% sqoop import --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets -m 1 --class-name WidgetHolder --as-sequencefile \
> --target-dir widget_sequence_files --bindir .
...
10/07/05 17:09:13 INFO mapreduce.ImportJobBase: Retrieved 3 records.

% mysql hadoopguide
mysql> CREATE TABLE widgets2(id INT, widget_name VARCHAR(100),
-> price DOUBLE, designed DATE, version INT, notes VARCHAR(200));
Query OK, 0 rows affected (0.03 sec)

mysql> exit;

% sqoop export --connect jdbc:mysql://localhost/hadoopguide \
> --table widgets2 -m 1 --class-name WidgetHolder \
> --jar-file widgets.jar --export-dir widget_sequence_files
...
10/07/05 17:26:44 INFO mapreduce.ExportJobBase: Exported 3 records.

More Examples of Sqoop Export

sqoop list-tables --driver com.teradata.jdbc.TeraDriver --connect connectString --username hue --password hue 

sqoop eval --driver com.teradata.jdbc.TeraDriver --connect connectString --username hue --password hue --query "delete STG.POS_TABLE all"

sqoop export --sqoop.export.records.per.statement=5000 --driver com.teradata.jdbc.TeraDriver --connect connectString --username hue --password hue --columns "TERR_CD,POS_BUSN_DT,gbal_id_nu,pos_area_typ_shrt_ds,pos_prd_dlvr_meth_cd,dypt_id_nu,dypt_ds,dy_of_cal_wk_ds,cd,digl_offr_trn_cnt_qt,digl_offr_trn_itm_cnt_qt,digl_offr_net_trn_am,digl_offr_grss_trn_am,non_digl_offr_trn_cnt_qt,non_digl_offr_trn_itm_cnt_qt,non_digl_offr_net_trn_am,non_digl_offr_grss_trn_am,tot_trn_cnt_qt,tot_trn_itm_cnt_qt,tot_net_trn_am,tot_grss_trn_am,digl_offr_tot_ord_sc_qt,non_digl_offr_tot_ord_sc_qt,tot_ord_tm_sc_qt,curn_iso_nu" 
--table STG.TABLE --input-fields-terminated-by '\t' -m 1 --export-dir /user/hive/warehouse/DLY_REC --batch 

The above query was not working with batch of 10000, reducing the number worked fine.

No comments:

Post a Comment