寫在前面
DataX 是阿里巴巴集團內被廣泛使用的異構數據源離線同步工具,致力於實現包括關係型數據庫(MySQL、Oracle等)、HDFS、Hive、MaxCompute(原ODPS)、HBase、FTP等各種異構數據源之間穩定高效的數據同步功能。
DataX本身作爲離線數據同步框架,採用Framework + plugin架構構建。將數據源讀取和寫入抽象成爲Reader/Writer插件,納入到整個同步框架中。目前已經有了比較全面的插件體系,主流的RDBMS數據庫、NOSQL、大數據計算系統都已經接入。
設計理念
爲了解決異構數據源同步問題,DataX將複雜的網狀的同步鏈路變成了星型數據鏈路,DataX作爲中間傳輸載體負責連接各種數據源。當需要接入一個新的數據源的時候,只需要將此數據源對接到DataX,便能跟已有的數據源做到無縫數據同步。
圖1…:
框架設計
DataX本身作爲離線數據同步框架,採用Framework + plugin架構構建。將數據源讀取和寫入抽象成爲 Reader/Writer插件,納入到整個同步框架中。
- Reader:Reader爲數據採集模塊,負責採集數據源的數據,將數據發送給Framework。
- Writer: Writer爲數據寫入模塊,負責不斷向Framework取數據,並將數據寫入到目的端。
- Framework:Framework用於連接reader和writer,作爲兩者的數據傳輸通道,並處理緩衝,流控,併發,數據轉換等核心技術問題。
優點
1、可靠的數據質量監控(讓數據可以完整無損的傳輸到目的端)
2、豐富的數據轉換功能
3、精準的速度控制
4、新版本DataX3.0提供了包括通道(併發)、記錄流、字節流三種流控模式,可以隨意控制你的作業速度,讓你的作業在庫可以承受的範圍內達到最佳的同步速度。
5、強勁的同步性能
每一種讀插件都有一種或多種切分策略,都能將作業合理切分成多個Task並行執行,單機多線程執行模型可以讓DataX速度隨併發成線性增長。
6、健壯的容錯機制(多層次局部/全局的重試)
7、極簡的使用體驗
下載即可用、詳細的日誌信息。
官網 json 文檔
json(mysql ==> mysql):
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": ["name","age"],
"where": "age<100",
"connection": [
{
"table": [
"person"
],
"jdbcUrl": [
"jdbc:mysql://127.0.0.1:3306/test?characterEncoding=utf8"
]
}
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123456",
"column": ["name","age_true"],
"connection": [
{
"table": [
"person"
],
"jdbcUrl":"jdbc:mysql://127.0.0.1:3306/test?characterEncoding=utf8"
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 1,
"byte": 104857600
},
"errorLimit": {
"record": 10,
"percentage": 0.05
}
}
}
組成部分:
它由三部分組成,分別是讀,寫和通用配置。
Reader部分
Writer部分
setting部分
job.setting.speed(流量控制)
Job支持用戶對速度的自定義控制,channel的值可以控制同步時的併發數,byte的值可以控制同步時的速度
job.setting.errorLimit(髒數據控制)
Job支持用戶對於髒數據的自定義監控和告警,包括對髒數據最大記錄數閾值(record值)或者髒數據佔比閾值(percentage值),當Job傳輸過程出現的髒數據大於用戶指定的數量/百分比,DataX Job報錯退出。
json(mysql ==> hdfs):
{
"job": {
"setting": {
"speed": {
"channel": 10,
"byte": 1000000,
"record": 100000
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "root",
"connection": [
{
"querySql": [
"select db_id,on_line_flag from xxxx where db_id < 10;"
],
"jdbcUrl": [
"jdbc:mysql://127.0.0.1:3306/database"
]
}
],
"table": "xxxx"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://shdc/",
"fileType": "text",
"path": "/user/hive/warehouse/ods_db.db/ods_zbt_ots_coin_account_record/dt=20191126/",
"fileName": "ods_zbt_ots_coin_account_record",
"column": [
{
"name": "id",
"type": "bigint"
},
{
"name": "userId",
"type": "bigint"
},
{
"name": "createTime",
"type": "bigint"
},
{
"name": "channel",
"type": "STRING"
},
{
"name": "coinType",
"type": "bigint"
},
{
"name": "uuid",
"type": "STRING"
},
{
"name": "bizCode",
"type": "string"
},
{
"name": "coinChangeNum",
"type": "bigint"
},
{
"name": "source",
"type": "STRING"
},
{
"name": "coinLeftNum",
"type": "bigint"
},
{
"name": "ext",
"type": "STRING"
},
{
"name": "updateTime",
"type": "bigint"
}
],
"writeMode": "append",
"fieldDelimiter": "\t",
"hadoopConfig": {
"dfs.nameservices": "shdc",
"dfs.ha.namenodes.shdc": "nn1,nn2",
"dfs.namenode.rpc-address.shdc.nn1": "dc-nn-01:8020",
"dfs.namenode.rpc-address.shdc.nn2": "dc-nn-02:8020",
"dfs.client.failover.proxy.provider.shdc": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
}
}
}]
}
}
json(hdfs ==> mysql):
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/hive/warehouse/ods_db.db/rpt_zbt_act_coins/dt=${vardate}/*",
"defaultFS": "hdfs://xxxx/",
"column": ["*"],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"nullFormat": "\\N",
"hadoopConfig": {
"dfs.nameservices": "xxxx",
"dfs.ha.namenodes.xxxx": "nn1,nn2",
"dfs.namenode.rpc-address.xxxx.nn1": "dc-nn-01:8020",
"dfs.namenode.rpc-address.xxxx.nn2": "dc-nn-02:8020",
"dfs.client.failover.proxy.provider.xxxx": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "name",
"password": "password",
"batchSize":"1024",
"column": ["logdt","act_code","act_name","coins","uv"],
"session": [],
"connection": [{
"table": ["xxxx"],
"jdbcUrl": "jdbc:mysql://xxx.xxx.xxx.xxx:3306/xx"
}]
}
}
}]
}
}
json (hive => hbase):
{
"job": {
"setting": {
"speed": {
"channel": 5
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/hive/warehouse/tmp.db/zbt_open/*",
"defaultFS": "hdfs://shdct",
"column": [
{
"index": 0,
"type": "String"
},
{
"index": 1,
"type": "String"
},
{
"index": 2,
"type": "String"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"nullFormat": "\\N",
"hadoopConfig": {
"dfs.nameservices": "shdct",
"dfs.ha.namenodes.shdct": "nn1,nn2",
"dfs.namenode.rpc-address.shdct.nn1": "test-dc-nn-01:8020",
"dfs.namenode.rpc-address.shdct.nn2": "test-dc-nn-02:8020",
"dfs.client.failover.proxy.provider.shdct": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
}
},
"writer": {
"name": "hbase11xwriter",
"parameter": {
"hbaseConfig": {
"hbase.rootdir": "hdfs://shdct/hbase",
"hbase.cluster.distributed": "true",
"hbase.zookeeper.quorum": "test-dc-dn-01:2181,test-dc-dn-02:2181,test-dc-dn-03:2181"
},
"table": "zbt_open",
"mode": "normal",
"rowkeyColumn": [
{
"index":1,
"type":"string"
}
],
"column": [
{
"index":0,
"name": "data:uid",
"type": "string"
},
{
"index":1,
"name": "data:device",
"type": "string"
},
{
"index":2,
"name": "data:qid",
"type": "string"
}
],
"encoding": "utf-8"
}
}
}
]
}
}
(注:hive:3.1.2、hbase:2.1.7、hadoop:3.1.2、DataX)
啓動:
實例:
python /home/hadoop/datax/bin/datax.py /home/hadoop/Jerry/job.json/test.json