CAP Confusion Problems with partition tolerance

The ‘CAP’ theorem is a hot topic in the design of distributed data storage systems. However, it’s often widely misused. In this post I hope to highlight why the common ‘consistency, availability and partition tolerance: pick two’ formulation is inadequate for distributed systems. In fact, the lesson of the theorem is that the choice is almost always between sequential consistency and high availability.
CAP理論在分佈式存儲系統設計當中是一個很熱門的話題。但是,經常被誤解。我希望在這偏文章中着重說明在分佈式系統設計中爲什麼作爲常識的公式"一致性,可用性,分區容錯性:只能選擇兩個"是不足的。實際上,CAP理論多數的選擇都是連續一致性和高可用性。

It’s very common to invoke the ‘CAP theorem’ when designing, or talking about designing, distributed data storage systems. The theorem, as commonly stated, gives system designers a choice between three competing guarantees:

Consistency – roughly meaning that all clients of a data store get responses to requests that ‘make sense’. For example, if Client A writes 1 then 2 to location X, Client B cannot read 2 followed by 1.

Availability – all operations on a data store eventually return successfully. We say that a data store is ‘available’ for, e.g. write operations.

Partition tolerance – if the network stops delivering messages between two sets of servers, will the system continue to work correctly?
在設計或者討論分佈式存儲系統中經常會涉及CAP理論。CAP理論給了系統設計者一個選擇在以下三個具有競爭關係的理論下:
一致性: 大致的意思是說要的存儲數據的客戶端所有請求回來的迴應都是有意義的。比如,如客戶端A在原先是2的值的位置X上寫了一個1,那麼客戶B在A寫入1以後,不可能會讀到2
可用性: 所有在存儲系統做的操作最終都會返回成功。我們就說存儲系統是可用的,比如寫操作
分區容錯性: 如果在兩個集羣服務中分發消息的網絡中斷了,這個系統還能正常工作嗎?

This is often summarised as a single sentence: “consistency, availability, partition tolerance. Pick two.”. Short, snappy and useful.
經常被總結爲一條簡單的原則:“一致性,可用性,分區容錯性,只能選擇兩個”。簡單,敏捷和實用。
At least, that’s the conventional wisdom. Many modern distributed data stores, including those often caught under the ‘NoSQL’ net, pride themselves on offering availability and partition tolerance over strong consistency; the reasoning being that short periods of application misbehavior are less problematic than short periods of unavailability. Indeed, Dr. Michael Stonebraker posted an article on the ACM’s blog bemoaning the preponderance of systems that are choosing the ‘AP’ data point, and that consistency and availability are the two to choose. However for the vast majority of systems, I contend that the choice is almost always between consistency and availability, and unavoidably so.
至少,這是一個傳統的觀念。許多現代的分佈式存儲系統,包括那些經常在NOSQL網絡下的數據存儲,以其提供可用性和分區容錯性而不是強一致性自豪,理由就是相比較段時間的不可用性,短時間的系統錯誤的行爲不是什麼太大的問題。然而,Michael Stonebraker博士在ACM博客上發表了一篇文章哀嘆選擇可用性和分區容錯性的系統的優勢,並指出一致性和可用性纔是應當選擇的。然而對於大多數系統來說,我覺得不可避免的選擇應當是在一致性和可用性之間。

Dr. Stonebraker’s central thesis is that, since partitions are rare, we might simply sacrifice ‘partition-tolerance’ in favour of sequential consistency and availability – a model that is well suited to traditional transactional data processing and the maintainance of the good old ACID invariants of most relational databases. I want to illustrate why this is a misinterpretation of the CAP theorem.
Stonebraker博士的中心點就是,因爲分區上很少見的,我們爲了支持強一致性應該簡單的犧牲掉分區容錯性-這是一種非常適合傳統事務性數據的處理和在大多數關係性數據庫維護ACID不變形的模型。我想要說明的是爲什麼這是對於CAP理論的一種曲解。

We first need to get exactly what is meant by ‘partition tolerance’ straight. Dr. Stonebraker asserts that a system is partition tolerant if processing can continue in both partitions in the case of a network failure.
我們首先要清楚精確的認識到分區容錯性意味值什麼。Stonebraker博士斷言如果所有的分區中網絡中斷的情況下依舊可以進行處理就是分區容錯性。

“If there is a network failure that splits the processing nodes into two groups that cannot talk to each other, then the goal would be to allow processing to continue in both subgroups.”
”如果存在網絡中斷把處理節點分成兩個互不通信的兩個組,設計的目標就是依舊可以在子組內進行處理“

This is actually a very strong partition tolerance requirement. Digging into the history of the CAP theorem reveals some divergence from this definition.
這確實是非常強的分區容錯性。深入研究CAP理論的歷史可以看到與這個定義的一些分歧。

Seth Gilbert and Professor Nancy Lynch provided both a formalisation and a proof of the CAP theorem in their 2002 SIGACT paper. We should defer to their definition of partition tolerance – if we are going to invoke CAP as a mathematical truth, we should formalize our foundations, otherwise we are building on very shaky ground. Gilbert and Lynch define partition tolerance as follows:

“The network will be allowed to lose arbitrarily many messages sent from one node to another”
Seth Gilbert 和 Nancy Lynch 教授形式化的證明CAP理論發表在他們的2002 SIGACT 論文上面。我們應當遵從他們對於分區容錯性的定義-如果我們把CAP理論作爲數學上的真相,我們應該形式化我們的基礎,否則我們將建立在搖搖欲墜的基礎上。Gilbert 和 Lynch定義的分區容錯性如下:
網絡可以被允許丟失很多任意的從一個節點發送到另外一個節點的消息
Note that Gilbert and Lynch’s definition isn’t a property of a distributed application, but a property of the network in which it executes. This is often misunderstood: partition tolerance is not something we have a choice about designing into our systems. If you have a partition in your network, you lose either consistency (because you allow updates to both sides of the partition) or you lose availability (because you detect the error and shutdown the system until the error condition is resolved). Partition tolerance means simply developing a coping strategy by choosing which of the other system properties to drop. This is the real lesson of the CAP theorem – if you have a network that may drop messages, then you cannot have both availability and consistency, you must choose one. We should really be writing Possibility of Network Partitions => not(availability and consistency), but that’s not nearly so snappy.
請注意,Gilbert 和 Lynch的定義不是從分佈式應用的方面,而是其執行所在網絡的屬性。這個經常被誤解:分區容錯性不是我們設計系統的一個選擇。如果你的網絡存在分區,你將會丟失一致性(因爲你將允許更新自兩個分區各自進行)和可用性(因爲你將發現錯誤和關閉系統直到錯誤被解決)。分區容錯性意味着簡單的開發一種拷貝的策略通過選擇丟掉其他的系統屬性。這是 CAP理論的中心點-如果你有網絡就可能丟失信息,然後你不可能同時擁有可用性和一致性,你必須選擇一個。我們應該寫的是分區的可能性,而不是可用性和一致性,但這並不是那麼快。

Dr. Stonebraker’s definition of partition tolerance is actually a measure of availability – if a write may go to either partition, will it eventually be responded to? This is a very meaningful question for systems distributed across many geographic locations, but for the LAN case it is less common to have two partitions available for writes. However, it is encompassed by the requirement for availability that we already gave – if your system is available for writes at all times, then it is certainly available for writes during a network partition.
Stonebraker博士對於分區容錯性的定義確實是可用性的度量-如果一次寫可能在不同的分區,最後會響應麼?這是一個對於分佈在不同區域的系統非常有意義的問題,但是在局域網的情況下很少會有兩個分區同時寫入的情況。但是它已經包含在我們已經給出的可用性要求中-如果你的系統始終可以進行寫的操作,那麼在網絡分區期間肯定可以進行寫操作。

So what causes partitions? Two things, really. The first is obvious – a network failure, for example due to a faulty switch, can cause the network to partition. The other is less obvious, but fits with the definition from Gilbert and Lynch: machine failures, either hard or soft. In an asynchronous network, i.e. one where processing a message could take unbounded time, it is impossible to distinguish between machine failures and lost messages. Therefore a single machine failure partitions it from the rest of the network. A correlated failure of several machines partitions them all from the network. Not being able to receive a message is the same as the network not delivering it. In the face of sufficiently many machine failures, it is still impossible to maintain availability and consistency, not because two writes may go to separate partitions, but because the failure of an entire ‘quorum’ of servers may render some recent writes unreadable.
那麼是什麼引起的分區?主要是兩個事情。第一件事情很顯然–比如由於切換故障引起的網絡錯誤會引起網絡分區。另外一個沒有那麼顯然,但是很符合Gilbert和 Lynch的定義:包括了軟件和硬件的機器錯誤。在一個異步網絡中,比如一個處理消息可能使用無限的時間,這種情況下不可能區分是機器失敗還是丟失消息。因此單臺的機器故障會將其與網絡的其他部分分區。一些機器的故障將會是得他們全部從網絡中分區。無法發送和接受消息。面對足夠多的機器,仍舊不可能維護可用性和一致性,不是因爲兩次寫在不同的分區,是因爲整個服務器的故障可能會使最近的一次寫變得不可讀。

This is why defining P as ‘allowing partitioned groups to remain available’ is misleading – machine failures are partitions, almost tautologously, and by definition cannot be available while they are failed. Yet, Dr. Stonebraker says that he would suggest choosing CA rather than P. This feels rather like we are invited to both have our cake and eat it. Not ‘choosing’ P is analogous to building a network that will never experience multiple correlated failures. This is unreasonable for a distributed system – precisely for all the valid reasons that are laid out in the CACM post about correlated failures, OS bugs and cluster disasters – so what a designer has to do is to decide between maintaining consistency and availability. Dr. Stonebraker tells us to choose consistency, in fact, because availability will unavoidably be impacted by large failure incidents. This is a legitimate design choice, and one that the traditional RDBMS lineage of systems has explored to its fullest, but it implicitly protects us neither from availability problems stemming from smaller failure incidents, nor from the high cost of maintaining sequential consistency.
這就是問什麼將分區容錯性定義爲“允許分區組保持可用性”會產生誤導的原因–機器故障時分區的,幾乎是自動的,根據定義當他們失敗的時候是不可用的。然而,Stonebraker博士建議選擇CA系統而不是P系統。感覺就像是邀請我們自己吃自己的蛋糕一樣。不選擇P就像是類似構建一個不會經歷多個故障的網絡。這個對於分佈式系統是不可理喻的–正是出於CACM論文中有關相關故障,操作系統故障和集羣災難所有合理的理由-所以一個設計者不得不從可用性和一致性兩者中作一個選擇。Stonebraker博士告訴我們選擇一致性,實際上,因爲可用性在大的故障事件是不可避免會被影響的。這是一個合理的設計選擇,傳統的RDBMS系統譜系已經對其進行了充分的探索,但它隱式地保護了我們既不會遭受較小的故障事件引起的可用性問題,也不會保護我們保持順序一致性的高昂成本。

When the scale of a system increases to many hundreds or thousands of machines, writing in such a way to allow consistency in the face of potential failures can become very expensive (you have to write to one more machine than failures you are prepared to tolerate at once). This kind of nuance is not captured by the CAP theorem: consistency is often much more expensive in terms of throughput or latency to maintain than availability. Systems such as ZooKeeper are explicitly sequentially consistent because there are few enough nodes in a cluster that the cost of writing to quorum is relatively small. The Hadoop Distributed File System (HDFS) also chooses consistency – three failed datanodes can render a file’s blocks unavailable if you are unlucky. Both systems are designed to work in real networks, however, where partitions and failures will occur*, and when they do both systems will become unavailable, having made their choice between consistency and availability. That choice remains the unavoidable reality for distributed data stores.

當系統規模增加到成百上千臺機器時,以面對潛在故障的一致性的方式來進行寫入會變得非常昂貴
(您不得不同時寫入比一臺更多的機器相比準備接受的故障)CAP定理沒有捕捉到這種細微差別:就吞吐量或維護延遲而言,一致性通常要比可用性貴得多。諸如ZooKeeper之類具有強一致性的系統,因爲羣集中的節點數量很少,因此寫入仲裁的成本相對較小。Hadoop分佈式文件系統(HDFS)也選擇一致性-如果您不走運的話,三個故障數據節點會導致文件塊不可用。兩種系統都設計爲可在實際網絡中工作,但是,在這些網絡中會發生分區和故障,並且當它們發生故障時,在一致性和可用性之間做出選擇後,兩個系統都將變得不可用。對於分佈式數據存儲來說,這種選擇仍然是不可避免的現實。
Further Reading
*For more on the inevitably of failure modes in large distributed systems, the interested reader is referred to James Hamilton’s LISA ‘07 paper On Designing and Deploying Internet-Scale Services.

Daniel Abadi has written an excellent critique of the CAP theorem.

James Hamilton also responds to Dr. Stonebraker’s blog entry, agreeing (as I do) with the problems of eventual consistency but taking issue with the notion of infrequent network partitions.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章