MySQL數據庫同步神器 - Gravity

原始地址：https://github.com/moiot/gravity

同步地址: https://gitee.com/yunwisdoms/gravity

架構簡介

單進程架構

單進程的 Gravity 採用基於插件的微內核架構，由各個插件圍繞系統裏的 core.Msg 結構實現輸入到輸出的整個流程。

各個插件有各自獨立的配置選項。

如上圖所示，系統總共由這幾個插件組成：

Input 用來適配各種數據源，比如 MySQL 的 Binlog 並生成 core.Msg
Filter 用來對 Input 所生成的數據流做數據變換操作，比如過濾某些數據，重命名某些列，對列加密
Output 用來將數據寫入目標，比如 Kafka, MySQL，Output 寫入目標時，使用 Router 所定義的路由規則
Scheduler 用來對 Input 生成的數據流調度，並使用 Output 寫入目標；Scheduler 定義了當前系統支持的一致性特性（當前默認的 Scheduler 支持同一行數據的修改有序）
Matcher 用來匹配 Input 生成的數據。Filter 和 Router 使用 Matcher 匹配數據

開發人員可以開發以上的幾個插件類型，實現特定的需求。

core.Msg 的定義如下

type DDLMsg struct {
	Statement string
}

type DMLMsg struct {
	Operation DMLOp
	Data      map[string]interface{}
	Old       map[string]interface{}
	Pks       map[string]interface{}
	PkColumns []string
}

type Msg struct {
	Type      MsgType
	Host      string
	Database  string
	Table     string
	Timestamp time.Time

	DdlMsg *DDLMsg
	DmlMsg *DMLMsg
	...
}

集羣架構

集羣版本的 Gravity 原生支持 Kubernetes 上的集羣部署，請查看這裏。

集羣版本 Gravity 提供 Rest API 創建創建數據同步任務，彙報狀態。自帶 Web 界面 (Gravity Admin) 管理各個任務。

下面以本地 MySQL 實例的同步和數據訂閱爲例說明 gravity 使用方法。

MySQL 環境準備

參考 MySQL 環境準備準備一下 MySQL 環境。

MySQL 源端和目標端創建需要同步的表

CREATE TABLE `test`.`test_source_table` (
  `id` int(11),
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `test`.`test_target_table` (
  `id` int(11),
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

編譯（可選，可以直接用 docker）

首先，配置好 Go 語言環境並編譯

git clone https://github.com/moiot/gravity.git

cd gravity && make

Gravity 使用了 go mod，中國大陸地區用戶建議設置 export GOPROXY=https://goproxy.io 或其他代理。將代碼克隆到 GOPATH 路徑下的用戶需要設置 export GO111MODULE=on。

MySQL 到 MySQL 同步

創建如下配置文件 config.toml

# name 必填
name = "mysql2mysqlDemo"

# 內部用於保存位點、心跳等事項的庫名，默認爲 _gravity
internal-db-name = "_gravity"

#
# Input 插件的定義，此處定義使用 mysql
#
[input]
type = "mysql"
mode = "stream"
[input.config.source]
host = "127.0.0.1"
username = "root"
password = ""
port = 3306

#
# Output 插件的定義，此處使用 mysql
#
[output]
type = "mysql"
[output.config.target]
host = "127.0.0.1"
username = "root"
password = ""
port = 3306

# 路由規則的定義
[[output.config.routes]]
match-schema = "test"
match-table = "test_source_table"
target-schema = "test"
target-table = "test_target_table"

MySQL 到 Kafka

創建如下配置文件 config.toml

name = "mysql2kafkaDemo"

#
# Input 插件的定義，此處定義使用 mysql
#
[input]
type = "mysql"
mode = "stream"
[input.config.source]
host = "127.0.0.1"
username = "root"
password = ""
port = 3306

#
# Output 插件的定義，此處使用 mysql
#
[output]
type = "async-kafka"
[output.config.kafka-global-config]
broker-addrs = ["127.0.0.1:9092"]
mode = "async"

# kafka 路由的定義
[[output.config.routes]]
match-schema = "test"
match-table = "test_source_table"
dml-topic = "test"

啓動 gravity

從編譯完的程序

bin/gravity -config mysql2mysql.toml

從 docker

docker run -v ${PWD}/config.toml:/etc/gravity/config.toml -d --net=host moiot/gravity:latest

監控

Gravity 使用 Prometheus 和 Grafana 實現監控功能。在運行端口（默認8080）上提供了 Prometheus 標準的指標抓取路徑/metrics。在源碼路徑deploy/grafana下提供了 Grafana dashboard 供導入。

配置文件

單進程的 Gravity 使用配置文件來配置。

Gravity 是基於插件的微內核模式，各個插件有自己獨立的配置。目前 Gravity 支持以 toml 格式和 json 格式作爲配置文件來配置。

本節的描述中，爲了方便起見，統一使用 toml 格式的配置文件描述配置規則。

Rest API

集羣方式部署的 Gravity 集羣使用 Rest API 來啓動任務，Rest API 和配置文件的 json 格式保持一致。

集羣方式部署提供 Web 界面配置，因此本節不再描述 Rest API 的各個選項，請參考 toml 格式的配置文件描述即可。

配置文件最少需要提供 Input 和 Output 的配置。

Input 配置

當前支持的 Input Plugin 有如下幾種

mysql 以 MySQL 作爲輸入源，支持全量、增量、全量+增量模式
mongo 當前僅支持增量模式。以 MongoDB 的 Oplog 作爲輸入源。

mysql 增量模式

MySQL 環境的準備

mysql 對源端 MySQL 的要求如下：

開啓 gtid 模式的 binlog
創建 gravity 賬戶，並賦予 replication 相關權限，以及 _gravity 數據庫的所有權限
MySQL 源端、目標端相應的表需要創建好

MySQL 配置項如下所示

[mysqld]
server_id=4
log_bin=mysql-bin
enforce-gtid-consistency=ON
gtid-mode=ON
binlog_format=ROW

gravity 賬戶權限如下所示

CREATE USER _gravity IDENTIFIED BY 'xxx';
GRANT SELECT, RELOAD, LOCK TABLES, REPLICATION SLAVE, REPLICATION CLIENT, CREATE, INSERT, UPDATE, DELETE ON *.* TO '_gravity'@'%';
GRANT ALL PRIVILEGES ON _gravity.* TO '_gravity'@'%';

mysql 增量配置文件

[input]
type = "mysql"
mode = "stream"

[input.config]
# 是否忽略雙向同步產生的內部數據，默認值爲 false
ignore-bidirectional-data = false

#
# 源端 MySQL 的連接配置
# - 必填
#
[input.config.source]
host = "127.0.0.1"
username = "_gravity"
password = ""
port = 3306
max-open = 20 # 可選，最大連接數
max-idle = 20 # 可選，最大空閒連接數，建議與 max-open 相同

#
# 開始增量同步的起始位置。
# - 默認爲空，從當前 gtid 位點開始同步
# - 可選
#
[input.config.start-position]
binlog-gtid = "abcd:1-123,egbws:1-234"

#
# 源端 MySQL 心跳檢測的特殊配置。若源端 MySQL 的心跳檢測（寫路徑）與 [input.mysql.source]
# 不一樣的話，可以在此配置。
# - 默認不配置此項。
# - 可選
#
[input.config.source-probe-config]
annotation = "/*some_annotataion*/"
[input.config.source-probe-config.mysql]
host = "127.0.0.1"
username = "_gravity"
password = ""
port = 3306

mysql 全量模式

如果表沒有主鍵，並且沒有唯一索引，並且並且表的總行數大於 max-full-dump-count，gravity會報錯。

你可以通過 ignore-tables 忽略這些表。

[input]
type = "mysql"
mode = "batch"
#
# 源端 MySQL 的連接配置
# - 必填
#
[input.config.source]
host = "127.0.0.1"
username = "_gravity"
password = ""
port = 3306

#
# 源端 MySQL 從庫的配置
# 如果有此配置，則掃描數據時優先從從庫掃描
# - 默認不配置此項
#
[input.config.source-slave]
host = "127.0.0.1"
username = "_gravity"
password = ""
port = 3306

#
# 需要掃描的表
# - 必填
[[input.config.table-configs]]
schema = "test_1"
table = "test_source_*"

[[input.config.table-configs]]
schema = "test_2"
table = "test_source_*"
# - 可選
# 指定掃描的列名字。默認情況下，如果不指定的話，系統會自動探測唯一鍵作爲掃描的列。
# 請仔細覈對這個配置，確保這個列上面有唯一索引。
scan-column = "id"

# ignore-tables 定義了掃描時忽略的表

# 定義 ignore-tables 可以忽略這些錯誤
[[input.config.ignore-tables]]
schema = "test_1"
table = "test_source_1"


[input.config]
# 總體掃描的併發線程數
# - 默認爲 10，表示最多允許 10 個表同時掃描
# - 可選
nr-scanner = 10

# 單次掃描所去的行數
# - 默認爲 10000，表示一次拉取 10000 行
# - 可選
table-scan-batch = 10000

# 全侷限制，每秒所允許的 batch 數
# - 默認爲 1
# - 可選
#
batch-per-second-limit = 1

# 全侷限制，沒有找到單列主鍵、唯一索引時，最多多少行的表可用全表掃描方式讀取，否則報錯退出。
# - 默認爲 100,000
# - 可選
#
max-full-dump-count = 10000

對於上面的默認配置，最多允許 10 個併發線程掃描源庫，每個線程一次拉取 10000 行；同時，系統全局每秒掃描 batch 數不超過 1 ，也就是不超過 10000 行每秒。

mysql 全量+增量

[input]
type = "mysql"
mode = "replication"

其餘設置分別於全量、增量相同。系統會先保存起始位點，再執行全量。若表未創建，會自動創建表結構。全量完成後自動從保存的位點開始增量。

mongo 增量

#
# 源端 Mongo 連接配置
# - 必填
#
[input]
type = "mongo"
mode = "stream"

#
# 源端 Mongo Oplog 的起始點，若不配置，則從當前最新的 Oplog 開始同步
# - 默認爲空
# - 可選
#
[input.config]
start-position = 123456

[input.config.source]
host = "127.0.0.1"
port = 27017
username = ""
password = ""

#
# 源端 Mongo Oplog 併發相關配置
# - 默認分別爲 false, 50, 512, "750ms"
# - 可選 （準備廢棄）
[input.config.gtm-config]
use-buffer-duration = false
buffer-size = 50
channel-size = 512
buffer-duration-ms = "750ms"

Output 配置

當前支持的 Output Plugin 有如下幾種

async-kafka 以異步方式向 Kafka 發送 Input 的消息
mysql 寫 MySQL
elasticsearch 在 Elasticsearch 中存儲和索引數據

下面依次解釋各個 Plugin 的配置選項

async-kafka

async-kafka 可以保證唯一鍵上發生的變更按順序發送到單個 partition，但並不能保證唯一鍵有變化時的順序。

例如

CREATE TABLE IF NOT EXISTS test (
  id int,
  v int,
  PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

INSERT INTO test (id, v) VALUES (1, 1); // (1, 1)
UPDATE test set v = 2 WHERE id = 1; //(1, 1) --> (1, 2)
UPDATE test set id = 2 WHERE id = 1; // (1, 2) --> (2, 2)

現在無法保證 (1, 1) --> (1, 2) 在 (1, 2) --> (2, 2) 之前發生。這兩個事件可能被髮送到不同的 partition。

#
# 目標端 Kafka 連接配置
# - 必填
#
[output]
type = "async-kafka"

#
# 目標端編碼規則：輸出類型和版本號
# - 可選
[output.config]
# 默認爲 json
output-format = "json"
# 默認爲 0.1 版本
schema-version = "0.1"

[output.config.kafka-global-config]
# - 必填
broker-addrs = ["localhost:9092"]
mode = "async"

# 目標端 kafka SASL 配置
# - 可選
[output.config.kafka-global-config.net.sasl]
enable = false
user = ""
password = ""

#
# 目標端 Kafka 路由配置
# - 必填
#
[[output.config.routes]]
match-schema = "test"
match-table = "test_table"
dml-topic = "test.test_table"

Kafka 輸出的 DML json 格式如下

{
   "version": "2.0",
   "database":"test",
   "table":"e",
   "type":"update",
   "data":{
      "id":1,
      "m":5.444,
      "c":"2016-10-21 05:33:54.631000+08:00",
      "comment":"I am a creature of light."
   },
   "old":{
      "m":4.2341,
      "c":"2016-10-21 05:33:37.523000"
   }
}

其中： type 表示操作類型: insert, update, delete, ddl； data 表示當前行此時的數據； old 表示當前行之前的數據（僅在 update 時有值）

時間類型的字段採用 rfc3399 的格式按照字符串輸出。

Kafka 輸出的 DDL json 格式如下

{
   "version": "2.0",
   "database":"test",
   "table":"t",
   "type":"ddl",
   "statement": " alter table test.t add column v int"
}

mysql

#
# 目標端 MySQL 連接配置
# - 必填
#
[output]
type = "mysql"

[output.config]
enable-ddl = true # 當前支持 create & alter table 語句。庫表名會根據路由信息調整。

[output.config.target]
host = "127.0.0.1"
username = ""
password = ""
port = 3306
max-open = 20 # 可選，最大連接數
max-idle = 20 # 可選，最大空閒連接數，建議與 max-open 相同

#
# 目標端 MySQL 路由配置；match-schema, match-table 支持 * 匹配
# - 必填
#
[[output.config.routes]]
match-schema = "test"
match-table = "test_source_table_*"
target-schema = "test"
target-table = "test_target_table"

#
# MySQL 執行引擎配置
# - 可選
#
[output.config.sql-engine-config]
type = "mysql-replace-engine"

[output.config.sql-engine-config.config]
tag-internal-txn = false

在上述配置中，如果配置了

[output.config.sql-engine-config.config]
# 開啓雙向同步標識的寫入
tag-internal-txn = true

Gravity 在寫入目標端 MySQL 的時會打上雙向同步的內部標識（通過封裝內部表事務的方式），在源端配置好 ignore-bidirectional-data 就可以忽略 Gravity 內部的寫流量。

Elasticsearch

重要：

這個插件還處於 Beta 階段
目前只支持 6.x 版本的 Elasticsearch

[output]
type = "elasticsearch"

#
# 目標端配置
# - 可選
#
[output.config]
# 忽略 400（bad request）返回
# 當索引名不規範、解析錯誤時，Elasticsearch 會返回 400 錯誤
# 默認爲 false，即遇到失敗時會拋出異常，必須人工處理。設置爲 true 時會忽略這些請求
ignore-bad-request = true

#
# 目標端 Elasticsearch 配置
# - 必選
#
[output.config.server]
# 連接的 Elasticsearch 地址，必選
urls = ["http://127.0.0.1:9200"]
# 是否進行節點嗅探，默認爲 false
sniff = false
# 超時時間，默認爲 1000ms
timeout = 500

#
# 目標端鑑權配置
# - 可選
#
[output.config.server.auth]
username = ""
password = ""

#
# 目標端路由配置
# - 必選
#
[[output.config.routes]]
match-schema = "test"
match-table = "test_source_table_*"
target-index = "test-index" # 默認爲每一條 DML 消息的表名
target-type = "_doc" # 默認爲 'doc'
# 是否忽略沒有主鍵的表
# output 會使用主鍵作爲文檔的 id，所以同步的表必須要有主鍵
# 默認爲 false，如果收到的 DML 消息沒有主鍵，會拋出異常。設置爲 true 則忽略這些消息
ignore-no-primary-key = false

EsModel

重要：

這個插件還處於 Beta 階段
目前支持 7.x 和 6.x 版本的 Elasticsearch
支持動態創建索引
支持一對一，一對多表關係
暫不支持多對多關係

配置示例：

[output]
type = "esmodel"

[output.config]
# 忽略 400（bad request）返回
# 當索引名不規範、解析錯誤時，Elasticsearch 會返回 400 錯誤
# 默認爲 false，即遇到失敗時會拋出異常，必須人工處理。設置爲 true 時會忽略這些請求
ignore-bad-request = true

#
# 目標端 Elasticsearch 配置
# - 必選
#
[output.config.server]
# 連接的 Elasticsearch 地址，必選
urls = ["http://192.168.1.152:9200"]
# 是否進行節點嗅探，默認爲 false
sniff = false
# 超時時間，默認爲 1000ms
timeout = 500
#失敗重試次數，默認3次
retry-count=3

#
# 目標端鑑權配置
# - 可選
#
[output.config.server.auth]
username = ""
password = ""


[[output.config.routes]]
match-schema = "test"
# 主表
match-table = "student"
#索引名
index-name="student_index"
#類型名，es7該項無效
type-name="student"
#分片數
shards-num=1
#副本數
replicas-num=0
#包含的列，默認全部
include-column = []
#排除的列，默認沒有
exclude-column = []

# 列名轉義策略
[output.config.routes.convert-column]
name = "studentName"


[[output.config.routes.one-one]]
match-schema = "test"
match-table = "student_detail"
#外鍵列
fk-column = "student_id"
#包含的列，默認全部
include-column = []
#排除的列，默認沒有
exclude-column = []
# 模式，1：子對象，2索引平鋪
mode = 2
# 屬性對象名，模式爲1時有效
property-name = "studentDetail"
# 屬性前綴，模式爲1時可以不填
property-pre = "sd_"

[output.config.routes.one-one.convert-column]
introduce = "introduceInfo"

[[output.config.routes.one-one]]
match-schema = "test"
match-table = "student_class"
#外鍵列
fk-column = "student_id"
#包含的列，默認全部
include-column = []
#排除的列，默認沒有
exclude-column = []
# 模式，1：子對象，2索引平鋪
mode = 1
# 屬性對象名，模式爲1時有效
property-name = "studentClass"
# 屬性前綴，模式爲1時可以不填
property-pre = ""

[output.config.routes.one-one.convert-column]
name = "className"

[[output.config.routes.one-many]]
match-schema = "test"
match-table = "student_parent"
#外鍵列
fk-column = "student_id"
#包含的列，默認全部
include-column = []
#排除的列，默認沒有
exclude-column = []
# 屬性對象名
property-name = "studentParent"

[output.config.routes.one-many.convert-column]
name = "parentName"

Filter 配置

Filter 定義了對 Input 消息的一些列變換操作。

Filter 是以數組的方式配置的，系統會按照順序執行每一個 Filter。

當前支持如下幾種 Filter：

reject 忽略匹配的源端消息
delete-dml-column 刪除源端 DML 消息裏的某些列
rename-dml-column 重命名源端 DML 消息裏的某些列

reject

[[filters]]
type = "reject"

[filters.config]
match-schema = "test"
match-table = "test_table_*"

reject Filter 會拒絕所有匹配到的 Input 消息，這些消息不會發送到下一個 Filter，也不會發送給 Output。

上面的例子裏，所有 schema 爲 test，table 名字爲 test_table_* 開頭的消息都會被過濾掉。

[[filters]]
type = "reject"
[filters.config]
match-schema = "test"
match-dml-op = "delete"

上面的例子裏，所有 schema 爲 test 的 delete 類型 DML 都會被過濾掉。

delete-dml-column

[[filters]]
type = "delete-dml-column"
[filters.config]
match-schema = "test"
match-table = "test_table"
columns = ["e", "f"]

delete-dml-column Filter 會刪除匹配到的 Input 消息裏的某些列。

上面的例子裏，所有 schema 爲 test, table 名字爲 test_table 的 DML 消息，它們的 e, f 列都會被刪除。

rename-dml-column

[[filters]]
type = "rename-dml-column"
[filters.config]
match-schema = "test"
match-table = "test_table"
from = ["a", "b"]
to = ["c", "d"]

`grpc-sidecar`

grpc-sidecar Filter 會下載一個你指定的二進制文件並啓動一個進程。你的這個程序需要實現一個 GRPC 的服務。一個 Golang 的例子在這裏

目前 grpc-sidecar 只支持修改 core.Msg.DmlMsg 裏的內容。

GRPC 協議的定義在這裏

[[filters]]
type = "grp-sidecar"
[filters.config]
match-schema = "test"
match-table = "test_table"
binary-url = "binary url that stores the binary"
name = "unique name of this plugin"

Scheduler 配置

當前支持的 Scheduler Plugin 只有一種

batch-table-scheduler 保證一個表內，由主鍵內容定義的同一行數據的更新操作有序

batch-table-scheduler

#
# batch-table-scheduler 配置
# - 可選
#
[scheduler]
type = "batch-table-scheduler"

[scheduler.config]
# 默認值 100
nr-worker = 100

# 默認值 1
batch-size = 1

# 默認值 1024
queue-size = 1024

# 默認值 10240
sliding-window-size = 10240

batch-table-scheduler 使用 worker pool 的方式調用 Output 定義的接口。

nr-worker 是 worker 的數目；

batch-table-scheduler 按照 batch 來使用 Output，它保證一個 batch 有相同的 core.Msg.Table

batch-size 是 batch 大小；

batch-table-scheduler 使用 sliding window 保證 Input 的位點按順序保存。

sliding-window-size 是 sliding window 大小。

Matcher

Matcher is used in filter and router. Existing matchers list here.

match-schema = "test"
match-table = "test_table_*"
match-table = ["a*", "b*"]
match-table-regex = "^t_\\d+$" # pay attention to `^` and `$`
match-table-regex = ["^a.*$", "^t_\\d+$"]
match-dml-op = "delete" # rejects ddl
match-dml-op = ["insert", "update", "delete"] 
match-ddl-regex = '(?i)^DROP\sTABLE' # rejects dml

MySQLToMySQL全量同步配置文件

mysql2mysql-full.toml
1.86 KB
 
Ryan-Git 提交於 2天前 . [skip ci] fix doc. use `tag-internal-txn` instead of `use-bidirection` (#264)

# 整個配置由 4 部分組成：
# - input: 定義 input plugin 的配置
# - filters: 定義 filters plugin 的配置，filter 用來對數據流做變更操作
# - output: 定義 output plugin 的配置
# - system: 定義系統級配置
#
# 圍繞 core.Msg, 系統定義若干個 match 函數，在配置文件裏使用 match 函數
# 來匹配 filter 和 output 的路由，filter/output 裏的每一個 match 函數
# 都匹配纔算滿足匹配規則
#
name = "mysql2mysqlDemo"
version = "1.0"

[input]
type = "mysql"
mode = "replication"

[input.config]
ignore-bidirectional-data = true

[input.config.source]
host = "127.0.0.1"
username = "root"
password = ""
port = 3306
max-open = 50 # optional, max connections
max-idle = 50 # optional, suggest to be the same as max-open

[[filters]]
type = "reject"
[filters.config]
match-schema = "test_db"
match-table = "test_table"

[[filters]]
type = "rename-dml-column"
[filters.config]
match-schema = "test"
match-table = "test_table_2"
from = ["b"]
to = ["d"]

[[filters]]
type = "delete-dml-column"
[filters.config]
match-schema = "test"
match-table = "test_table"
columns = ["e", "f"]

[[filters]]
type = "dml-pk-override"
[filters.config]
match-schema = "test"
match-table = "test_table"
id = "another_id"

[output]
type = "mysql"

[output.config]
enable-ddl = true

[output.config.target]
host = "127.0.0.1"
username = "root"
password = ""
port = 3306
max-open = 20 # optional, max connections
max-idle = 20 # optional, suggest to be the same as max-open

[output.config.sql-engine-config]
type = "mysql-replace-engine"

[output.config.sql-engine-config.config]
tag-internal-txn = true

[[output.config.routes]]
match-schema = "test_db"
match-table = "test_table"
target-schema = "test_db"
target-table = "*"

[scheduler]
type = "batch-table-scheduler"
[scheduler.config]
nr-worker = 20
batch-size = 10
queue-size = 1024
sliding-window-size = 1024

MySQL數據庫同步神器 - Gravity - 比Datax好用