Sqoop1.4.4將MySQL中數據導入到Hive表中

問題導讀:

         1、--hive-import、--hive-overwrite的作用?

         2、如何處理關係型數據庫字段中字符串含有分隔符問題?

         3、使用--hive-import默認字段分隔符是?一行記錄分隔符是?

         4、NULL值是怎麼處理的?--null-string和--null-string的作用?

         5、--hive-partition-key作用?分區值類型必須是?

         6、--compression-codec作用?

         7、Sqoop將關係型數據庫數據導入Hive中,默認導入數據庫是?默認導入路徑是?如何指定某個特定的數據庫?

一、原理及部分參數簡介

        1、Sqoop import主要功能是將數據導入到HDFS文件中。但是如果你有與Hadoop集羣關聯的Hive元數據服務,Sqoop也可以將數據導入到Hive中,通過創建表語句CREATE TABLE <表名>在Hive中定義數據佈局。導入數據到Hive中很簡單,在Sqoop命令行中,使用--hive-import選項。
        2、 如果Hive中表已經存在,你可以使用--hive-overwrite選擇來表明,在Hive中的此表必須被替換。如果你的數據已經被導入到HDFS中或者此步驟被忽略了,Sqoop將會產生一個Hive腳本,它包含"CREATE TABLE"操作使用Hive數據類型定義列和“LOAD DATA INPATH” 語句的來將數據文件加載到Hive warehouse目錄中。
        3、儘管Hive支持轉義字符,但是它並不處理新行字符的轉義。Hive使用Sqoop導入數據將會出現問題,當你的數據行中包含着含有Hive默認行分隔符(\n or \r)的字符串字段或者列分隔符(\01)。你可以使用--hive-drop-import-delims選項將那些字符刪掉,提供給Hive兼容的文本數據,或者使用--hive-delims-replacement選擇替換掉那些字符以用戶自定義字符串以提供給Hive兼容的文本數據。這些選項僅僅在你使用Hive默認分隔符的時候使用。在已經指定不同分隔符的情況下,不應該使用。
      如果你不使用指定的分隔符而使用--hive-import,那麼默認使用^A作爲字段分隔符 \n作爲一條記錄分隔符
      4、Sqoop默認的會將NULL值以string類型導入,在Hive中默認是使用\N表示NULL。這會導致,在Hive中執行查詢的使用(like IS NULL)不起作用。你可以使用 --null-string和 --null-non-string在導入操作的時候。或者使用--input-null-string和--input-null-non-string在導出的時候來保護NULL值。因爲Sqoop將會在生成的代碼中使用這些參數。所以需要將\N變爲\\N。如:$ sqoop import ... --null-string '\\N' --null-non-string '\\N'。
      5、使用--hive-table 修改輸出表名
      6、Hive可以將數據分區導入以便提高查詢性能。你可以使用--hive-partition-key或者 --hive-partition-value參數告訴Sqoop job以指定分區形式導入Hive中。分區值必須是個字符串
      7、使用--compression-codec 如後面接"com.hadoop.compression.lzo.LzopCodec"可以實現壓縮。

二、啓動Hive元數據服務

       Hive安裝此處不介紹,我使用的版本是Hive0.13.1,Hadoop2.2.0

[hadoopUser@secondmgt ~]$ hive --service metastore
Starting Hive Metastore Server
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
15/01/18 20:13:52 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed

三、原始數據庫

       查看原數據庫表中內容

mysql> select * from users;
+----+-------------+----------+-----+-----------+------------+-------+------+
| id | username    | password | sex | content   | datetime   | vm_id | isad |
+----+-------------+----------+-----+-----------+------------+-------+------+
| 56 | hua         | sqoop    | 男  | 開通      | 2013-12-02 |     0 |    1 |
| 58 | feng        | 123456   | 男  | 開通      | 2013-11-22 |     0 |    0 |
| 59 | test        | 123456   | 男  | 開通      | 2014-03-05 |    58 |    0 |
| 60 | user1       | 123456   | 男  | 開通      | 2014-06-26 |    66 |    0 |
| 61 | user2       | 123      | 男  | 開通      | 2013-12-13 |    56 |    0 |
| 62 | user3       | 123456   | 男  | 開通      | 2013-12-14 |     0 |    0 |
| 64 | kai.zhou    | 123456   | ?   | ??        | 2014-03-05 |    65 |    0 |
| 65 | test1       | 111      | 男  | 未開通    | NULL       |     0 |    0 |
| 66 | test2       | 111      | 男  | 未開通    | NULL       |     0 |    0 |
| 67 | test3       | 113      | 男  | 未開通    | NULL       |     0 |    0 |
| 68 | sqoopincr01 | 113      | 男  | 未開通    | NULL       |     0 |    0 |
| 69 | sqoopincr02 | 113      | 男  | 未開通    | NULL       |     0 |    0 |
| 70 | sqoopincr03 | 113      | 男  | 未開通    | NULL       |     0 |    0 |
+----+-------------+----------+-----+-----------+------------+-------+------+
13 rows in set (0.00 sec)
四、執行基本導入操作

[hadoopUser@secondmgt ~]$ sqoop import --hive-import --connect jdbc:mysql://secondmgt:3306/spice --username hive  --password hive --table users --hive-table hiveusers
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
15/01/18 20:22:19 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
15/01/18 20:22:19 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
15/01/18 20:22:19 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
15/01/18 20:22:19 WARN tool.BaseSqoopTool: It seems that you've specified at least one of following:
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --hive-home
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --hive-overwrite
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --create-hive-table
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --hive-table
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --hive-partition-key
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --hive-partition-value
15/01/18 20:22:19 WARN tool.BaseSqoopTool:      --map-column-hive
15/01/18 20:22:19 WARN tool.BaseSqoopTool: Without specifying parameter --hive-import. Please note that
15/01/18 20:22:19 WARN tool.BaseSqoopTool: those arguments will not be used in this session. Either
15/01/18 20:22:19 WARN tool.BaseSqoopTool: specify --hive-import to apply them correctly or remove them
15/01/18 20:22:19 WARN tool.BaseSqoopTool: from command line to remove this warning.
15/01/18 20:22:19 INFO tool.BaseSqoopTool: Please note that --hive-home, --hive-partition-key,
15/01/18 20:22:19 INFO tool.BaseSqoopTool:       hive-partition-value and --map-column-hive options are
15/01/18 20:22:19 INFO tool.BaseSqoopTool:       are also valid for HCatalog imports and exports
15/01/18 20:22:19 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
15/01/18 20:22:19 INFO tool.CodeGenTool: Beginning code generation
15/01/18 20:22:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 1
15/01/18 20:22:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 1
15/01/18 20:22:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
Note: /tmp/sqoop-hadoopUser/compile/e5d129e2de5bcdea0f7e1db4beb24461/users.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
15/01/18 20:22:21 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/e5d129e2de5bcdea0f7e1db4beb24461/users.jar
15/01/18 20:22:21 WARN manager.MySQLManager: It looks like you are importing from mysql.
15/01/18 20:22:21 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
15/01/18 20:22:21 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
15/01/18 20:22:21 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
15/01/18 20:22:21 INFO mapreduce.ImportJobBase: Beginning import of users
15/01/18 20:22:21 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/01/18 20:22:21 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
15/01/18 20:22:21 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/01/18 20:22:22 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 20:22:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `users`
15/01/18 20:22:23 INFO mapreduce.JobSubmitter: number of splits:4
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
15/01/18 20:22:23 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/01/18 20:22:23 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
15/01/18 20:22:23 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/01/18 20:22:23 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/01/18 20:22:23 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/01/18 20:22:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0022
15/01/18 20:22:23 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0022 to ResourceManager at secondmgt/192.168.2.133:8032
15/01/18 20:22:23 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0022/
15/01/18 20:22:23 INFO mapreduce.Job: Running job: job_1421373857783_0022
15/01/18 20:22:36 INFO mapreduce.Job: Job job_1421373857783_0022 running in uber mode : false
15/01/18 20:22:36 INFO mapreduce.Job:  map 0% reduce 0%
15/01/18 20:22:46 INFO mapreduce.Job:  map 25% reduce 0%
15/01/18 20:22:50 INFO mapreduce.Job:  map 75% reduce 0%
15/01/18 20:22:56 INFO mapreduce.Job:  map 100% reduce 0%
15/01/18 20:22:56 INFO mapreduce.Job: Job job_1421373857783_0022 completed successfully
15/01/18 20:22:57 INFO mapreduce.Job: Counters: 27
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=368076
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=401
                HDFS: Number of bytes written=521
                HDFS: Number of read operations=16
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=8
        Job Counters
                Launched map tasks=4
                Other local map tasks=4
                Total time spent by all maps in occupied slots (ms)=165592
                Total time spent by all reduces in occupied slots (ms)=0
        Map-Reduce Framework
                Map input records=13
                Map output records=13
                Input split bytes=401
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=192
                CPU time spent (ms)=10500
                Physical memory (bytes) snapshot=588156928
                Virtual memory (bytes) snapshot=3525619712
                Total committed heap usage (bytes)=337641472
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=521
15/01/18 20:22:57 INFO mapreduce.ImportJobBase: Transferred 521 bytes in 35.2437 seconds (14.7828 bytes/sec)
15/01/18 20:22:57 INFO mapreduce.ImportJobBase: Retrieved 13 records.
15/01/18 20:22:57 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 1
15/01/18 20:22:57 WARN hive.TableDefWriter: Column datetime had to be cast to a less precise type in Hive
15/01/18 20:22:57 INFO hive.HiveImport: Loading uploaded data into Hive
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
15/01/18 20:23:03 INFO hive.HiveImport: 15/01/18 20:23:03 INFO Configuration.deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
15/01/18 20:23:03 INFO hive.HiveImport:
15/01/18 20:23:03 INFO hive.HiveImport: Logging initialized using configuration in file:/home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/conf/hive-log4j.properties
15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Class path contains multiple SLF4J bindings.
15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
15/01/18 20:23:03 INFO hive.HiveImport: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/01/18 20:23:06 INFO hive.HiveImport: OK
15/01/18 20:23:06 INFO hive.HiveImport: Time taken: 1.542 seconds
15/01/18 20:23:06 INFO hive.HiveImport: Loading data to table default.hiveusers
15/01/18 20:23:06 INFO hive.HiveImport: Table default.hiveusers stats: [numFiles=5, numRows=0, totalSize=521, rawDataSize=0]
15/01/18 20:23:06 INFO hive.HiveImport: OK
15/01/18 20:23:06 INFO hive.HiveImport: Time taken: 0.668 seconds
15/01/18 20:23:06 INFO hive.HiveImport: Hive import complete.
15/01/18 20:23:06 INFO hive.HiveImport: Export directory is empty, removing it.
       上面是完整的執行過程,內部執行實際分三部,1.將數據導入hdfs(可在hdfs上找到相應目錄),2.創建hive表名相同的表,3,將hdfs上數據傳入hive表中 。如果不指定數據庫,默認是導入到default數據庫中,導入指定的數據庫中的某個表中可以使用"數據庫名 . 表名"。默認存放路徑是warehouse/<表名>/目錄下。

      查看結果:

hive> select * from hiveusers;
OK
56      hua     sqoop   男      開通    2013-12-02      0       1
58      feng    123456  男      開通    2013-11-22      0       0
59      test    123456  男      開通    2014-03-05      58      0
60      user1   123456  男      開通    2014-06-26      66      0
61      user2   123     男      開通    2013-12-13      56      0
62      user3   123456  男      開通    2013-12-14      0       0
64      kai.zhou        123456  ?       ??      2014-03-05      65      0
65      test1   111     男      未開通  null    0       0
66      test2   111     男      未開通  null    0       0
67      test3   113     男      未開通  null    0       0
68      sqoopincr01     113     男      未開通  null    0       0
69      sqoopincr02     113     男      未開通  null    0       0
70      sqoopincr03     113     男      未開通  null    0       0
Time taken: 0.46 seconds, Fetched: 13 row(s)

desc       describe
hive> desc hiveusers;
OK
id                      int
username                string
password                string
sex                     string
content                 string
datetime                string
vm_id                   int
isad                    int
Time taken: 0.133 seconds, Fetched: 8 row(s)

       由上面查詢結果可知NULL被當做字符串處理了,以null被導入進Hive表中

五、對NULL值進行處理

[hadoopUser@secondmgt ~]$ sqoop import  --connect jdbc:mysql://secondmgt:3306/spice --username hive  --password hive --table users --hive-table test.hiveusers --hive-import  --hive-overwrite  --null-string '\\N' --null-non-string '\\N'
        再次查看結果:

hive> select * from hiveusers;
OK
56      hua     sqoop   男      開通    2013-12-02      0       1
58      feng    123456  男      開通    2013-11-22      0       0
59      test    123456  男      開通    2014-03-05      58      0
60      user1   123456  男      開通    2014-06-26      66      0
61      user2   123     男      開通    2013-12-13      56      0
62      user3   123456  男      開通    2013-12-14      0       0
64      kai.zhou        123456  ?       ??      2014-03-05      65      0
65      test1   111     男      未開通  NULL    0       0
66      test2   111     男      未開通  NULL    0       0
67      test3   113     男      未開通  NULL    0       0
68      sqoopincr01     113     男      未開通  NULL    0       0
69      sqoopincr02     113     男      未開通  NULL    0       0
70      sqoopincr03     113     男      未開通  NULL    0       0
Time taken: 0.064 seconds, Fetched: 13 row(s)
       NULL值已正確處理

六、完整例子

      附上一個完整例子,壓縮沒驗證,有興趣的可以驗證下是否正確,可與我交流一下:

[hadoopUser@secondmgt ~]$ sqoop import  --connect jdbc:mysql://secondmgt:3306/spice --username hive  --password hive --table users --hive-table test.hiveusers --hive-import  --hive-overwrite  --null-string '\\N' --null-non-string '\\N' --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --compression-codec "com.hadoop.compression.lzo.LzopCodec"


推薦閱讀:

        上一篇:Sqoop1.4.4將文件數據集從HDFS中導出到MySQL數據庫表中


        下一篇:Sqoop1.4.4將MySQL數據庫表中數據導入到HBase表中


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章