Sqoop的使用

介紹：

Sqoop是Apache旗下的一款開源工具，主要用於 關係型數據庫（Oracle、Mysql等）與 非關係型數據庫（Hive、HBase等）之間的數據傳遞，可以將關係型數據庫中的數據導到HDFS上，也可以將HDFS上的數據導到關係型數據庫中。

版本問題：

Sqoop目前有Sqoop1 與 Sqoop2 兩個版本。雖然都是Sqoop的版本，但是Sqoop1和Sqoop2 卻完全不兼容。在實際項目中，使用更多的是Sqoop1，所以我們使用Sqoop1版本。

架構與原理：

Sqoop的架構十分簡單，可以說是Hadoop生態中架構最爲簡單的一個框架。其工作原理是：通過sqoop的客戶端接入Hadoop，將其sqoop任務解析成MapReduce任務執行。

Sqoop架構圖如下：

Sqoop的安裝本文不作介紹，具體安裝步驟請參考：

http://blog.csdn.net/u010476994/article/details/72247562

sqoop命令執行方式：

1、直接命令行執行：例如

sqoop list-databases --connect jdbc:mysql://192.168.152.101:3306/ --username root --password 123456

sqoop create-hive-table --connect jdbc:mysql://192.168.152.101:3306/ --username root --password 123456 --table TA_MBL_DVLP_PACKAGE_D --hive-table TA_MBL_DVLP_PACKAGE_DAY_TEST

根據mysql中的表，創建hive表，只有表結構，沒有數據。 --table是mysql中的表，--hive-table是hive中的表

2、將命令寫入文件，讀取文件執行。舉例：

新建一個文件option1，在option1中寫入以下內容：

list-databases
--connect
jdbc:mysql://192.168.152.101:3306/
--username
root
--password
123456

一個參數佔一行，保存退出。

執行： sqoop --options-file option1

Sqoop常用命令及參數：

說明：一般我們所說的導入導出，都是以hdfs的角度來說的。即sqoop導入說的是往hdfs中導入數據，sqoop導出是從hdfs往外導出數據。以下內容是官網給出的參數：

1、import

Argument	Description
`--append`	Append data to an existing dataset in HDFS
`--as-avrodatafile`	Imports data to Avro Data Files
`--as-sequencefile`	Imports data to SequenceFiles
`--as-textfile`	Imports data as plain text (default)
`--as-parquetfile`	Imports data to Parquet Files
`--boundary-query <statement>`	Boundary query to use for creating splits
`--columns <col,col,col…>`	Columns to import from table
`--delete-target-dir`	Delete the import target directory if it exists
`--direct`	Use direct connector if exists for the database
`--fetch-size <n>`	Number of entries to read from database at once.
`--inline-lob-limit <n>`	Set the maximum size for an inline LOB
`-m,--num-mappers <n>`	Use n map tasks to import in parallel
`-e,--query <statement>`	Import the results of `statement`.
`--split-by <column-name>`	Column of the table used to split work units. Cannot be used with `--autoreset-to-one-mapper` option.
`--autoreset-to-one-mapper`	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with `--split-by <col>` option.
`--table <table-name>`	Table to read
`--target-dir <dir>`	HDFS destination dir
`--warehouse-dir <dir>`	HDFS parent for table destination
`--where <where clause>`	WHERE clause to use during import
`-z,--compress`	Enable compression
`--compression-codec <c>`	Use Hadoop codec (default gzip)
`--null-string <null-string>`	The string to be written for a null value for string columns
`--null-non-string <null-string>`	The string to be written for a null value for non-string columns

常用參數解釋：

(1) 關於文件格式：Sqoop支持三種文件格式，包含一種文本格式和兩種二進制格式。二進制格式分別是Avro和SequenceFile。使用--as-avrodatafile或--as-sequencefile以指定具體使用哪種二進制格式。默認是文本格式，也是最常用的格式。

(2) --columns

在執行導入操作時，可以指定將關係型數據庫表的哪些字段導入到hdfs中。

(3) --delete-target-dir

如果指定的hdfs目錄已經存在，則先進行刪除，再執行導入操作。

(4) --m

指定要開啓幾個map task執行並行導入操作。默認值是1，即沒有開啓並行功能。一般和 --split-by配合使用，由--split-by指定按照哪個字段進行拆分。

例如： --m 3 --split-by id，即是按id分成三段，並行執行導入操作。具體原理大致如下：

首先根據id（假設爲int類型），查詢出 max(id) 與 min(id) ，界定id的範圍，假如是min(id) = 1，max(id)=15，sqoop執行時，會按1~5、6~10、11~15三段，來並行執行。

sqoop還支持拆分其它類型的字段，如 Date,Text,Float,Integer，Boolean,NText,BigDecimal等等。

在拆分字段時，儘量找值分佈均勻的字段，保證並行任務之間的數據量大致相等，以達到最大的執行效率。

(5) --split-by 按照哪個字段進行拆分。

(6) -e 、--query 可以跟一段sql，將sql的執行結果導入到hdfs當中。

在使用該參數時，有一個特別需要注意的地方。在官方文檔中，對此進行了說明：

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process
will replace with a unique condition expression. 
大意如下：當以並行的方式將查詢結果導入時，每一個map task 都需要執行一個查詢副本，會將sqoop命令中的約束條件進行分區查詢，獲取結果。 查詢語句中必須包含“$CONDITIONS”， sqoop查詢會根據該標識符替換特定的約束條件。
也就是說，在使用-e、--query參數時，必須包含 where子句 和 "$CONDITIONS"條件。
另：如果查詢語句使用雙引號，則 $CONDITIONS 需要使用\進行轉義， 即 \$CONDITIONS ; 若使用單引號，則不需要轉義。

(7) --table 指定數據來源。

(8) --target-dir 指定數據要導到hdfs下的那個目錄下

(9) --where 添加篩選條件

舉例：

Mysql --> HDFS

將Mysql中的psn表的三個字段（id、name、age）導入到hdfs的/sqoop/data目錄下，篩選條件是 “age = 25”

sqoop import --connect jdbc:mysql://192.168.152.101:3306/mysql --username root --password 123456 --table psn --columns id,name,age --delete-target-dir --target-dir /sqoop/data -m 3 --split-by id --where "age=25"

Mysql --> Hive

sqoop import --connect  jdbc:mysql://192.168.152.101:3306/mysql --username root --password 123456 --as-textfile
--query  'select id, name, msg from psn where id like "1%" and 
$CONDITIONS'  --delete-target-dir --target-dir  /sqoop/tmp
-m  1 --hive-home  /home/hive-1.2.1
--hive-import  --create-hive-table  --hive-table  t_test

在使用

2、export

Argument	Description
`--columns <col,col,col…>`	Columns to export to table
`--direct`	Use direct export fast path
`--export-dir <dir>`	HDFS source path for the export
`-m,--num-mappers <n>`	Use n map tasks to export in parallel
`--table <table-name>`	Table to populate
`--call <stored-proc-name>`	Stored Procedure to call
`--update-key <col-name>`	Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
`--update-mode <mode>`	Specify how updates are performed when new rows are found with non-matching keys in database.
	Legal values for `mode` include `updateonly` (default) and `allowinsert`.
`--input-null-string <null-string>`	The string to be interpreted as null for string columns
`--input-null-non-string <null-string>`	The string to be interpreted as null for non-string columns
`--staging-table <staging-table-name>`	The table in which data will be staged before being inserted into the destination table.
`--clear-staging-table`	Indicates that any data present in the staging table can be deleted.
`--batch`	Use batch mode for underlying statement execution.

舉例：

sqoop export  --connect jdbc:mysql://192.168.152.101:3306/mysql --username root --password 123456
-m  1  --columns  id,name,msg  --export-dir  /sqoop/data  --table  h_psn

mapreduce自定義類型-空指針異常之坑NullPointerException

大數據常見端口彙總-hadoop、hbase、hive、spark、kafka、zookeeper等（持續更新）

Mac環境下， VMware Fusion下的虛擬機（ CentOS 7）的 NAT網絡配置

遍歷ArrayList，並刪除某些元素的方法實現

MySQL無法登錄問題-"ERROR 1045 (28000): Access denied for user 'root'@'localhost'"-之解決方法-密碼重置

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結