hadoop16--sqoop

大數據協作框架

在hadoop生態領域中, 協作框架主要分爲以下四種:

sqoop: 關係型數據庫導入, 導出到HDFS, HIVE, HBASE
flume: 日誌收集框架, 主要收集日誌服務器上產生的文件
oozie: 任務調度框架, 在YARN上提交的任務有很多, 每個任務什麼時候運行, 該如何運行, 都是需要調度的
hue: 可視化工具

sqoop 框架功能與版本介紹

sqoop功能

sqoop是一個從關係型數據庫中導入到HDFS上, 或者從HDFS上導出到mysql中數據的一種工具. sqoop中底層跑的是MR程序, 其實就是MR程序的一種封裝. 在sqoop中封裝的MR程序只存在map輸出 , 沒有reduce輸出. 所以說sqoop是基於hadoop之上的數據導入導出工具

sqoop版本

sqoop分爲兩個版本, sqoop1和sqoop2

sqoop2

由客戶端與服務器端一起運行, 類似於hiveserver2 與 beeline之間的關係, 由於涉及到了客戶端與服務器端就牽扯到了網絡IO以及網絡穩定, 並且會跨節點執行, 穩定性較差, 不推薦使用

sqoop1

sqoop1是純客戶端模式, 比較穩定, 推薦使用

sqoop部署安裝

解壓sqoop安裝包

sudo tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz -C /opt/app/

修改配置sqoop環境配置文件名(env)

//0. 修改文件權限
sudo chown -R hadoop:hadoop sqoop-1.4.6.bin__hadoop-2.0.4-alpha/

//1. 修改conf/sqoop-env-template.sh --> sqoop-env.sh
mv sqoop-env-template.sh sqoop-env.sh

修改配置sqoop環境配置

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/app/hadoop-2.7.2

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/app/hadoop-2.7.2

#Set the path to where bin/hive is available
export HIVE_HOME=/opt/app/apache-hive-1.2.1-bin

sqoop 是和關係型數據庫進行導入和導出的, 所以需要用到jdbc的jar包, 所以需要把jdbc的jar包拷貝到sqoop目錄的lib目錄下

cp apache-hive-1.2.1-bin/lib/mysql-connector-java-5.1.31.jar sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib/

測試sqoop配置是否成功—在sqoop的目錄下執行bin/sqoop help 可以查看sqoop常用的命令

bin/sqoop help
-----------------
Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

list-databases: 列出數據庫

連接數據庫—使用sqoop中連接數據庫, 並且顯示數據庫參數, 測試sqoop連接數據庫

bin/sqoop list-databases --connect jdbc:mysql://localhost:3306 --username root --password root
----------------------------------------------------------
resultset.
information_schema
metastore
mysql
performance_schema
test

另一種寫法: 使用\ 可以換行, 代表一行命令沒有結束

bin/sqoop list-databases \
--connect jdbc:mysql://localhost:3306 \
--username root \
--password root
---------------------------
resultset.
information_schema
metastore
mysql
performance_schema
test

sqoop導入命令

把關係型數據庫中的數據導入到HDFS上, 首先必須鏈接到關係型數據庫中.

首先在關係型數據庫中建立一張表

create table tohdfs(
	id int primary key not null,
	name varchar(20) not null
);
Query OK, 0 rows affected (0.15 sec)

mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| tohdfs         |
+----------------+
1 row in set (0.00 sec)

向測試表中添加測試數據

mysql> insert into tohdfs(id ,name)values(1,"zhangsan");
Query OK, 1 row affected (0.04 sec)

mysql> insert into tohdfs(id ,name)values(2,"lisi");
Query OK, 1 row affected (0.04 sec)

編寫sqoop命令, 從mysql到HDFS

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs

導入命令執行完畢之後默認路徑爲/user/hadoop/

導入存在的問題

在數據庫中每條記錄就作爲一個map運行導入
導入到hdfs上的一個數據沒有自定義指定路徑

解決問題

數據導出到指定的路徑–通過參數–target-dir 可以指定關係型數據庫導入到hdfs上的指定路徑

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--target-dir /tohdfs

數據導出並指定map任務的並行執行數量–通過–num-mappers 設置map任務的並行運行數量

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--target-dir /tohdfs2 \
--num-mappers 1

對於以上的兩個問題, 已經解決了, 但是每次導入數據的時候都要求刪除文件夾, 在sqoop中存在一個參數, 能夠在導入之前刪掉之前的目標文件夾

–delete-target-dir : 如果目標路徑存在則刪除

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--delete-target-dir \
--target-dir /tohdfs2 \
--num-mappers 1

在sqoop導入到hdfs上的默認分隔符默認爲" ," , 同時也可以在導入的時候指定分隔符

–fields-terminated-by Sets the field separator character

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--fields-terminated-by '\t' \
--delete-target-dir \
--target-dir /tohdfs2 \
--num-mappers 1

sqoop增量導入

在隨着系統的運作, 每天都會產生新的數據, 在進行數據導入的時候, 由於之前已經導入過數據, 所以在產生新的數據的時候, 就可以直接在之前導過的數據的基礎之上繼續導入數據, 沒有必要將之前的數據再次導入一遍.

增量導入參數: bin/sqoop import --help

Incremental import arguments:
   --check-column <column>        Source column to check for incremental
                                  change
   --incremental <import-type>    Define an incremental import of type
                                  'append' or 'lastmodified'
   --last-value <value>           Last imported value in the incremental
                                  check column

增量導入參數

–check-column : 檢查列, 檢查數據庫中的索引列, 一般都是數據表中的主鍵.
–incremental : 增量導入的類型
1. append: 追加
2. lastmodified: 最後一次修改的時間
–last-value 最後值, 從給定上次導入索引的最後值, 在導入新數據的時候, 從最後值的下一個記錄開始導入

增量導入數據的使用:

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--fields-terminated-by '\t' \
--target-dir /tohdfs2 \
--num-mappers 1 \
--check-column id \
--incremental append \
--last-value 2

在進行增量導入的時候, 首先需要注意, 之前的刪除文件夾的參數, 本身與增量導入是衝突的, 所以在增量導入的時候不能加入刪除指定目錄的參數

由於hdfs的特點, 不能修改 , 所以在追加導入的時候, hdfs會創建新的文件來保存追加的內容

sqoop job

sqoop job可以理解爲創建一個模板, 在下次運行導入或者導出任務的時候, 直接運行的模板就可以執行任務. 可以理解爲把需要操作的任務封裝爲一個方法, 之後運行這個方法名稱就可以運行封裝好的任務.

使用job

查看job的信息

bin/sqoop job --help

通過查看job的信息得到以下的內容

Job management arguments:
   --create <job-id>            Create a new saved job
   --delete <job-id>            Delete a saved job
   --exec <job-id>              Run a saved job
   --help                       Print usage instructions
   --list                       List saved jobs
   --meta-connect <jdbc-uri>    Specify JDBC connect string for the
                                metastore
   --show <job-id>              Show the parameters for a saved job
   --verbose                    Print more information while working

創建一個job

bin/sqoop job \
--create stu_info \
-- \
import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--fields-terminated-by '\t' \
--target-dir /tohdfs2 \
--num-mappers 1 \
--check-column id \
--incremental append \
--last-value 2

查看已經創建過的job(–list)

bin/sqoop job --list
------------------------
Available jobs:
  stu_info

查看job的詳細信息

bin/sqoop job --show stu_info
----------------------------
Options:
----------------------------
verbose = false
incremental.last.value = 2
db.connect.string = jdbc:mysql://localhost:3306/test
codegen.output.delimiters.escape = 0
codegen.output.delimiters.enclose.re
...	...
codegen.compile.dir = /tmp/sqoop-hadoop/compile/9ce5792bc31f0ba35d2a3823452d6d7a
direct.import = false
hdfs.target.dir = /tohdfs2
hive.fail.table.exists = false
db.batch = false

執行job任務(需要輸入mysql密碼進行驗證)

bin/sqoop job --exec stu_info

刪除job任務

bin/sqoop job --delete stu_info

將mysql的數據導入到hive中

將數據導入到hive中, 但是實際上來說, hive中的數據也是在hdfs上來保存的, 在進行導入hive的本質其實就是先把數據導入到hdfs中, 再從hdfs上把數據遷移到hive表所在的hdfs路徑

在hive中準備要導入的數據庫以及數據表

create database student;
use student;
create table stu_info(id int , name string) row format delimited fields terminated by '\t';

導入到hive中

bin/sqoop import \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table  tohdfs \
--hive-import \
--hive-database student \
--hive-table stu_info \
--target-dir /tohdfs3 \
--fields-terminated-by '\t' \
--num-mappers 1

sqoop導出

以hdfs爲標準, 導出到關係型數據庫

在mysql中準備接收的數據表

create table tomysql(
id int primary key not null,
name varchar(20) not null
);

通過bin/sqoop export --help 查詢參數

Common arguments:
   --connect <jdbc-uri>                         Specify JDBC connect
                                                string
   --connection-manager <class-name>            Specify connection manager
                                                class name
   --connection-param-file <properties-file>    Specify connection
                                                parameters file
   --driver <class-name>                        Manually specify JDBC
                                                driver class to use
   --hadoop-home <hdir>                         Override
                                                $HADOOP_MAPRED_HOME_ARG
   --hadoop-mapred-home <dir>                   Override
                                                $HADOOP_MAPRED_HOME_ARG
   --help                                       Print usage instructions
... ...
   --hive-partition-value <partition-value>         Sets the partition
                                                    value to use when
                                                    importing to hive
   --map-column-hive <arg>                          Override mapping for
                                                    specific column to
                                                    hive types.

導出數據到關係型數據庫中mysql(不用使用分隔符)

bin/sqoop export \
--connect jdbc:mysql://localhost:3306/test \
--username root \
--password root \
--table tomysql \
--export-dir /tohdfs \
--num-mappers 1

日誌分析

對一個網站的日誌進行分析, 得到兩個指標, pv和uv, 將分析出來的結果導入到mysql進行保存

操作步驟

在hive中創建一個用於分析網站指標的數據庫

create database log;

在數據庫中創建源數據表, 加載數據

use log;
create table log_source(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trackerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition  string
)
row format delimited fields terminated by '\t'
stored as textfile;

加載數據到源數據表中

load data local inpath '/home/hadoop/data/201808.txt' into table log_source;
load data local inpath '/home/hadoop/data/201809.txt' into table log_source;

數據加載完成, 創建臨時表, 導出需要用到分析的字段與數據

create table log_qingxi
(
 id string ,
 url string,
 guid string,
 day string,
 hour string
)
row format delimited fields terminated by '\t';

創建完成後, 從源數據表中導入數據到清洗表中

insert into table log_qingxi select id,url,guid,substring(trackTime,9,2) day,substring(trackTime,12,2) hour from log_source;

將清洗表中的數據導入到分析的臨時表中進行分區處理, 創建臨時分區表

create table log_part1
(
 id string ,
 url string,
 guid string
)
partitioned by(day string,hour string)
row format delimited fields terminated by '\t';

導入數據到分區表中

insert into table log_part1 partition(day='20150828',hour='18')
select id,url,guid from log_qingxi where day='28' and hour='18';

insert into table log_part1 partition(day='20150828',hour='19')
select id,url,guid from log_qingxi where day='28' and hour='19';

統計pv和uv

//統計pv
select day,hour,count(url) pv from log_part1 where not url="" group by day,hour ;
 OK
day     hour    pv
20150828        18      35880
20150828        19      33317

//統計uv
select day,hour,count(distinct guid) uv from log_part1 group by day,hour;
OK
day     hour    uv
20150828        18      23938
20150828        19      22330

創建將要輸出的結果表

create table result as select day,hour,count(url) pv ,count(distinct guid) uv from log_part1 where not url="" group by day,hour;
OK
result.day      result.hour     result.pv       result.uv
20150828        18      35880   16068
20150828        19      33317   14590

在mysql中創建需要接收數據的數據表

create table save(
day varchar(30) not null,
hour varchar(30) not null,
pv varchar(30) not null,
uv varchar(30) not null,
primary key(day,hour)
);

利用sqoop實現導出到mysql命令

bin/sqoop export \
--connect jdbc:mysql://hadoop:3306/test \
--username root \
--password root \
--table save \
--export-dir /user/hive/warehouse/log.db/result \
--num-mappers 1 \
--input-fields-terminated-by '\001'

轉到mysql查看結果

mysql> select * from save;
+----------+------+-------+-------+
| day      | hour | pv    | uv    |
+----------+------+-------+-------+
| 20150828 | 18   | 35880 | 16068 |
| 20150828 | 19   | 33317 | 14590 |
+----------+------+-------+-------+