sqoop 中文文檔 User guide 四 validation

10.validation

10.1.Purpose

Validate the data copied, either import or export by comparing the row counts from the source and the target post copy.

校驗數據拷貝,導出導入通過比較源數據和目標數據(就是導出或導入後需要的數據)的行數。

10.2.Introduction

There are 3 basic interfaces: ValidationThreshold - Determines if the error margin between the source and target are acceptable:Absolute, PercentageTolerant, etc.Default implementation is AbsoluteValidationThreshold which ensures the row counts from source and targets are the same.

有三個基本的接口:

ValidationThreshold-確定源和目標之間的誤差範圍是否可以接受:絕對值,寬容的,等百分比,等等。默認的實現是AbsoluteValidationThreshold,它保證行數從源和目標是相同的。

ValidationFailureHandler - Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.

ValidationFailureHandler - 負責處理故障:記錄錯誤/警告,中止等。默認的實現是LogOnFailureHandler,它記錄一個警告消息到配置的日誌記錄器中。

Validator - Drives the validation logic by delegating the decision to ValidationThreshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCountValidator which validates the row counts from source and the target.

Validator -驅動驗證邏輯,通過委派ValidationThreshold做決定並委派ValidationFailureHandler做錯誤處理。默認的實現是,RowCountValidator,它校驗了源和目標的行數。

10.3.Syntax

$ sqoop import (generic-args) (import-args)
$ sqoop export (generic-args) (export-args)

Validation arguments are part of import and export arguments.

校驗參數是導入和導出的參數的一部分。

10.4.Configuration

The validation framework is extensible and pluggable. It comes with default implementations but the interfaces can be extended to allow custom implementations by passing them as part of the command line arguments as described below.


校驗框架是可配置和可插拔的,這是因爲雖然有默認實現,但是接口可以通過允許自定義實現擴展,通過這些選項作爲命令行參數的一部分。如下所述

Validator.

Property:         validator
Description:      Driver for validation,
                  must implement org.apache.sqoop.validation.Validator
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.RowCountValidator

Validation Threshold.

Property:         validation-threshold
Description:      Drives the decision based on the validation meeting the
                  threshold or not. Must implement
                  org.apache.sqoop.validation.ValidationThreshold
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.AbsoluteValidationThreshold

Validation Failure Handler.

Property:         validation-failurehandler
Description:      Responsible for handling failures, must implement
                  org.apache.sqoop.validation.ValidationFailureHandler
Supported values: The value has to be a fully qualified class name.
Default value:    org.apache.sqoop.validation.LogOnFailureHandler

10.5.Limitations

Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:

驗證目前只驗證數據從一個表複製到HDFS。以下是在當前的實現中的限制(下列情況不能用校驗):

  • all-tables option 所有表查詢

  • free-form query option 查詢語句選項

  • Data imported into Hive or HBase 數據導入到hive或HBase

  • table import with --where argument 有wheret條件的表導入

  • incremental imports 增量導入

10.6.Example Invocations

A basic import of a table namedEMPLOYEESin thecorpdatabase that uses validation to validate the row counts:

一個基本的導入,在corp數據庫表一張命名爲EMPLOYEES的表,使用校驗驗證行數:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp  \
    --table EMPLOYEES --validate

A basic export to populate a table namedbarwith validation enabled:

一個基本的導出,啓用了校驗,它填充一張名爲bar的表。

$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar  \
    --export-dir /results/bar_data --validate

Another example that overrides the validation args:

另外一個例子覆蓋了校驗參數:

$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
    --validate --validator org.apache.sqoop.validation.RowCountValidator \
    --validation-threshold \
          org.apache.sqoop.validation.AbsoluteValidationThreshold \
    --validation-failurehandler \
          org.apache.sqoop.validation.LogOnFailureHandler

11.Saved Jobs

Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.

導入和導出可以重複執行通過多次使用相同的命令。尤其是當使用增量導入功能,這是一個預期的場景

Sqoop allows you to definesaved jobswhich make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on thesqoop-jobtool describes how to create and work with saved jobs.

Sqoop允許您定義saved jobs  ,使這個過程更簡單。一個save job記錄了以後要執行的一個Sqoop命令的配置信息。sqoop-job工具一節的內容描述瞭如何創建和使用保存的job。

By default, job descriptions are saved to a private repository stored in$HOME/.sqoop/. You can configure Sqoop to instead use a sharedmetastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on thesqoop-metastoretool.

默認情況下,job描述保存到一個私人存儲庫,這個存儲庫存儲在$ HOME / .sqoop /。您可以配置Sqoop轉而使用一個共享的metastore,使保存的職位對於多個用戶可用在一個共享的集羣。啓動metastore在sqoop-metastore工具章節中有講解


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章