Data replication 同步技術


複本同步技術

There are 2 ways how the master propagate updates to the slave; State transfer and Operation transfer.

  • In State transfer, the master passes its latest state to the slave, which then replace its current state with the latest state.
  • In Operation transfer, the master propagate a sequence of operations to the slave which then apply the operations in its local state.

State Transfer Model (狀態傳輸模型)

優點是比較robust, 消息丟失後面還可以同步彌補, 缺點傳輸的流量比較大 
當然爲了提高效率, 也不會每次都傳輸所有的state, 只會傳輸改變的部分(delta change) 
問題是, 如果在分佈式的環境下知道異地複本之間的差異? 最簡單的方法就是把一方的數據全傳過來看看, 但這樣無法減少傳輸量 
常用的方法是使用Merkle tree, 通過傳遞digest來降低傳輸量, 來定位delta change

The state transfer model is more robust against message lost because as long as a latter more updated message arrives, the replica still be able to advance to the latest state.

Even in state transfer mode, we don't want to send the full object for updating other replicas because changes typically happens within a small portion of the object. In will be a waste of network bandwidth if we send the unchanged portion of the object, so we need a mechanism to detect and send just the delta (the portion that has been changed). 
One common approach is break the object into chunks and compute a hash tree of the object (Merkle tree). So the replica can just compare their hash tree to figure out which chunk of the object has been changed and only send those over.

Merkle Tree

Hash tree, http://en.wikipedia.org/wiki/Hash_tree

Hash trees were invented in 1979 by Ralph Merkle
hash tree or Merkle tree is a tree in which every non-leaf node is labelled with the hash of the labels of its children nodes.

如何使用Merkle trees經行高效的複本同步, 參考Amazon's Dynamo, 4.7處理永久性故障:副本同步

Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees

http://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees

image

image

 

Operation transfer mode (操作日誌傳輸模型)

Operation transfer 優點是需要傳輸的數據少, 但需要可靠的消息系統來保證不丟失 
在分佈式的環境下會大大增加複雜性, 似乎很少有使用這種方案的 
In operation transfer mode, usually much less data need to be send over the network. However, it requires a reliable message mechanism with delivery order guarantee.

 

基於Gossip和Vector Clock的同步技術

State Transfer Model

In a state transfer model, each replica maintain a vector clock as well as a state version tree where each state is neither > or < among each other (based on vector clock comparison). In other words, the state version tree contains all the conflicting updates.  
每個replica都維護vector clock, 以及state version tree(Merkle tree)

Query Processing

Client發出query request, 並附上該replica在client端的V-client 
Server端收到後, 返回具有比client所附帶vector clock新(V-state > V-client)的那部分數據和其對應的vector clock 
Client收到查詢結果後, 會將自己的vector clock和server傳回的vector clock進行merge

image

Update Processing

Client發出update命令並附上當前V-client 
Server收到後, 發現Vclient比當前服務器的所有V-state都舊(V-client < all V-state, 表明當前服務器的狀態已經包含此更新), 則拋棄該更新命令 
否則說明是新的改動, server更新狀態值, V-state和merle tree, 並返回新的V-state

image

Internode Gossiping

版本間同步, 需要通過Merkle Tree來定位delta change, 並最終完成同步

image

 

Operation Transfer Model

In an operation transfer approach, the sequence of applying the operations is very important. At the minimum causal order need to be maintained. 
Because of the ordering issue, each replica has to defer executing the operation until all the preceding operations has been executed
Therefore replicas save the operation request to a log file and exchange the log among each other and consolidate these operation logs to figure out the right sequence to apply the operations to their local store in an appropriate order.

對於這種模式, 必須保證 
1. 消息不丟失 
2. 所有更新命令以相同順序被執行, 全序問題 
3. 更新命令需要defer執行, 直到在他之前的更新命令都已經被執行

 

Query Processing, 和STM沒啥區別  
When a query is submitted by the client, it will also send along its vector clock which reflect the client's view of the world. The replica will check if it has a view of the state that is later than the client's view.

image

 

Update Processing

When an update operation is received, the replica will buffer the update operation until it can be applied to the local state. Every submitted operation will be tag with 2 timestamp, V-client indicates the client's view when he is making the update request. V-receive is the replica's view when it receives the submission. 
This update operation request will be sitting in the queue until the replica has received all the other updates that this one depends on. This condition is reflected in the vector clock Vi when it is larger than V-client. 
關鍵在於如何知道該更新操作之間的操作是否已經被執行? 
通過vector clock的比較是一種辦法, 如圖中算法, 當V-client < Vi(replica當前的vector)的時候, 認爲replica已經執行過之前所有的更新命令

但vector clock只能保證偏序, 而非全序, 所以這個方法並無法保證更新順序完全一致

The concurrent update problem at different replica can also happen. Which means there can be multiple valid sequences of operation. In order for different replica to apply concurrent update in the same order, we need a total ordering mechanism.

image

 

Internode Gossiping, 本質上update沒有區別

On the background, different replicas exchange their log for the queued updates and update each other's vector clock. After the log exchange, each replica will check whether certain operation can be applied (when all the dependent operation has been received) and apply them accordingly.

image

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章