10.validation
Validate the data copied, either import or export by comparing the row counts from the source and the target post copy.
校驗數據拷貝,導出導入通過比較源數據和目標數據(就是導出或導入後需要的數據)的行數。
There are 3 basic interfaces: ValidationThreshold - Determines if the error margin between the source and target are acceptable:Absolute, PercentageTolerant, etc.Default implementation is AbsoluteValidationThreshold which ensures the row counts from source and targets are the same.
有三個基本的接口:
ValidationThreshold-確定源和目標之間的誤差範圍是否可以接受:絕對值,寬容的,等百分比,等等。默認的實現是AbsoluteValidationThreshold,它保證行數從源和目標是相同的。
ValidationFailureHandler - Responsible for handling failures: log an error/warning, abort, etc. Default implementation is LogOnFailureHandler that logs a warning message to the configured logger.
ValidationFailureHandler - 負責處理故障:記錄錯誤/警告,中止等。默認的實現是LogOnFailureHandler,它記錄一個警告消息到配置的日誌記錄器中。
Validator - Drives the validation logic by delegating the decision to ValidationThreshold and delegating failure handling to ValidationFailureHandler. The default implementation is RowCountValidator which validates the row counts from source and the target.
Validator -驅動驗證邏輯,通過委派ValidationThreshold做決定並委派ValidationFailureHandler做錯誤處理。默認的實現是,RowCountValidator,它校驗了源和目標的行數。
$ sqoop import (generic-args) (import-args) $ sqoop export (generic-args) (export-args)
Validation arguments are part of import and export arguments.
校驗參數是導入和導出的參數的一部分。
The validation framework is extensible and pluggable. It comes with default implementations but the interfaces can be extended to allow custom implementations by passing them as part of the command line arguments as described below.
校驗框架是可配置和可插拔的,這是因爲雖然有默認實現,但是接口可以通過允許自定義實現擴展,通過這些選項作爲命令行參數的一部分。如下所述
Validator.
Property: validator Description: Driver for validation, must implement org.apache.sqoop.validation.Validator Supported values: The value has to be a fully qualified class name. Default value: org.apache.sqoop.validation.RowCountValidator
Validation Threshold.
Property: validation-threshold Description: Drives the decision based on the validation meeting the threshold or not. Must implement org.apache.sqoop.validation.ValidationThreshold Supported values: The value has to be a fully qualified class name. Default value: org.apache.sqoop.validation.AbsoluteValidationThreshold
Validation Failure Handler.
Property: validation-failurehandler Description: Responsible for handling failures, must implement org.apache.sqoop.validation.ValidationFailureHandler Supported values: The value has to be a fully qualified class name. Default value: org.apache.sqoop.validation.LogOnFailureHandler
Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation:
驗證目前只驗證數據從一個表複製到HDFS。以下是在當前的實現中的限制(下列情況不能用校驗):
all-tables option 所有表查詢
free-form query option 查詢語句選項
Data imported into Hive or HBase 數據導入到hive或HBase
table import with --where argument 有wheret條件的表導入
incremental imports 增量導入
A basic import of a table namedEMPLOYEES
in thecorp
database that uses validation to validate the row counts:
一個基本的導入,在corp數據庫表一張命名爲EMPLOYEES的表,使用校驗驗證行數:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \ --table EMPLOYEES --validate
A basic export to populate a table namedbar
with validation enabled:
一個基本的導出,啓用了校驗,它填充一張名爲bar的表。
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar \ --export-dir /results/bar_data --validate
Another example that overrides the validation args:
另外一個例子覆蓋了校驗參數:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \ --validate --validator org.apache.sqoop.validation.RowCountValidator \ --validation-threshold \ org.apache.sqoop.validation.AbsoluteValidationThreshold \ --validation-failurehandler \ org.apache.sqoop.validation.LogOnFailureHandler
Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.
導入和導出可以重複執行通過多次使用相同的命令。尤其是當使用增量導入功能,這是一個預期的場景
Sqoop allows you to definesaved jobswhich make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on thesqoop-job
tool describes how to create and work with saved jobs.
Sqoop允許您定義saved jobs ,使這個過程更簡單。一個save job記錄了以後要執行的一個Sqoop命令的配置信息。sqoop-job工具一節的內容描述瞭如何創建和使用保存的job。
By default, job descriptions are saved to a private repository stored in$HOME/.sqoop/
. You can configure Sqoop to instead use a sharedmetastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on thesqoop-metastore
tool.
默認情況下,job描述保存到一個私人存儲庫,這個存儲庫存儲在$ HOME / .sqoop /。您可以配置Sqoop轉而使用一個共享的metastore,使保存的職位對於多個用戶可用在一個共享的集羣。啓動metastore在sqoop-metastore工具章節中有講解