The Google File System : part3 SYSTEM INTERACTIONS

3. SYSTEM INTERACTIONS
We designed the system to minimize the master’s involvement in all operations.
With that background, we now describe how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snapshot.
3.系統交互
我們設計了系統，以儘量減少master對所有操作的參與。
在這種背景下，我們現在描述客戶端，master服務器和chunkserver如何交互來實現數據突變，原子記錄追加和快照。

3.1 Leases and Mutation Order
A mutation is an operation that changes the contents or metadata of a chunk such as a write or an append operation.
Each mutation is performed at all the chunk’s replicas.
We use leases to maintain a consistent mutation order across replicas.
The master grants a chunk lease to one of the replicas, which we call the primary.
The primary picks a serial order for all mutations to the chunk.
All replicas follow this order when applying mutations.

Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary.
The lease mechanism is designed to minimize management overhead at the master.
A lease has an initial timeout of 60 seconds.
However, as long as the chunk is being mutated, the primary can request and typically receive extensions from the master indefinitely.
These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers.
The master may sometimes try to revoke a lease before it expires
(e.g., when the master wants to disable mutations on a file that is being renamed).
Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires.
In Figure 2, we illustrate this process by following the control flow of a write through these numbered steps.

3.1租賃和變更令
突變是改變諸如寫入(write)或追加(append)操作之類的塊的內容(contents)或metadata的操作。
每個突變都是在所有的塊的複本上進行的。
我們使用租賃來保持複製品間的一致變異順序。
master向其中一個副本授予了一個塊租約，我們稱之爲primary副本。
primary 選擇一個序列順序的所有突變的塊。
所有複製品在應用突變時遵循此順序。
因此，全局突變順序首先由master選擇的租賃授權訂單定義，並在租賃期間由primary分配的序列號定義。
租賃機制旨在最大限度地減少 master 的管理開銷。
租約的初始超時時間爲60秒。
但是，只要大塊被突變，primary可以無限期地請求並通常從master接收擴展。
這些擴展請求和授權是在Master和所有chunkserver之間定期交換的HeartBeat消息中搭載的。
master 有時可能會嘗試在租約過期前撤銷租約
（例如，當master想要禁用正在重命名的文件上的突變時）。
即使master與primary進行通信，也可以在舊租約到期後，安全地向另一個副本發放新租約。
在圖2中，我們通過遵循這些編號步驟的寫入的控制流程來說明該過程。

1. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas.
If no one has a lease, the master grants one to a replica it chooses (not shown).

2. The master replies with the identity of the primary and the locations of the other (secondary) replicas.
The client caches this data for future mutations.
It needs to contact the master again only when the primary becomes unreachable or replies that it no longer holds a lease.
The client pushes the data to all the replicas.
A client can do so in any order.
Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out.
By decoupling the data flow from the control flow, we can improve performance by scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary.
Section 3.2 discusses this further.
Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary.
The request identifies the data pushed earlier to all of the replicas.
The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization.
It applies the mutation to its own local state in serial number order.
The primary forwards the write request to all secondary replicas.
Each secondary replica applies mutations in the same serial number order assigned by the primary.

1.客戶端向 master 詢問哪個chunkserver持有該塊的當前租約和其他副本的位置。
如果沒有人有租約，則master授予一個選擇的副本（未顯示）。

2.master以primary身份和其他（次要）副本的位置進行回覆。
客戶端緩存這些數據用於將來的突變。
只有當master服務器無法訪問或者不再存在租約時，才需要再次與master服務器聯繫。
客戶端將數據推送到所有副本。
客戶可以按任何順序進行。
每個chunkserver將數據存儲在內部LRU緩衝區高速緩存中，直到數據被使用或老化爲止。
通過將數據流與控制流分離，我們可以通過調度基於網絡拓撲的昂貴的數據流來提高性能，而不管 primary 的是哪個chunkserver。
第3.2節進一步討論。
一旦所有的副本已經確認接收到數據，客戶端向主服務器發送寫請求。
該請求標識了早期推送到所有副本的數據。
primary分配連續的序列號到其接收到的所有突變，可能來自多個客戶端，這提供必要的序列化。
它以序列號順序將突變應用於其本地狀態。
primary將寫入請求轉發到所有輔助副本。
每個次要複製品以primary分配的相同序列號順序應用突變。

The secondaries all reply to the primary indicating that they have completed the operation.
The primary replies to the client.
Any errors encountered at any of the replicas are reported to the client.
In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas.
(If it had failed at the primary, it would not have been assigned a serial number and forwarded.)
The client request is considered to have failed, and the modified region is left in an inconsistent state.
Our client code handles such errors by retrying the failed mutation.
It will make a few attempts at steps (3) through (7) before falling back to a retry from the beginning of the write.
If a write by the application is large or straddles a chunk boundary, GFS client code breaks it down into multiple write operations.
They all follow the control flow described above but may be interleaved with and overwritten by concurrent operations from other clients. Therefore, the shared file region may end up containing fragments from different clients, although the replicas will be identical because the individual operations are completed successfully in the same order on all replicas.
This leaves the file region in consistent but undefined state as noted in Section 2.7.

次級人員都對primary表示已經完成了該操作。
primary回覆客戶端。
在任何副本上遇到的任何錯誤都會報告給客戶端。
在出現錯誤的情況下，寫入可能在次要副本的primary和任意子集中成功。
（如果primary失敗，則不會分配序列號並轉發。）
客戶端請求被認爲失敗，修改後的區域處於不一致的狀態。
我們的客戶端代碼通過重試失敗的突變來處理這些錯誤。
在從寫入開始返回到重試之前，將對步驟（3）至（7）進行幾次嘗試。
如果應用程序的寫入較大或跨越塊邊界，則GFS客戶機代碼將其分解爲多個寫入操作。
它們都遵循上述控制流程，但可以與來自其他客戶端的併發操作交織並覆蓋。因此，共享文件區域可能會包含來自不同客戶端的片段，儘管副本將是相同的，因爲單個操作在所有副本上以相同的順序成功完成。
這使文件區域保持一致但未定義的狀態，如第2.7節所述。

3.2 Data Flow

We decouple the flow of data from the flow of control to use the network efficiently.
While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunkservers in a pipelined fashion.
Our goals are to fully utilize each machine’s network bandwidth, avoid network bottlenecks and high-latency links, and minimize the latency to push through all the data.
To fully utilize each machine’s network bandwidth, the data is pushed linearly along a chain of chunkservers rather than distributed in some other topology (e.g., tree).
Thus, each machine’s full outbound bandwidth is used to transfer the data as fast as possible rather than divided among multiple recipients.
To avoid network bottlenecks and high-latency links (e.g.,inter-switch links are often both) as much as possible, each machine forwards the data to the “closest” machine in the network topology that has not received it.
Suppose the client is pushing data to chunkservers S1 through S4.
It sends the data to the closest chunkserver, say S1.
S1 forwards it to the closest chunkserver S2 through S4 closest to S1, say S2.
Similarly, S2 forwards it to S3 or S4, whichever is closer to S2, and so on.
Our network topology is simple enough that “distances” can be accurately estimated from IP addresses.

Finally, we minimize latency by pipelining the data transfer over TCP connections.
Once a chunkserver receives some data, it starts forwarding immediately.
Pipelining is especially helpful to us because we use a switched network with full-duplex links.
Sending the data immediately does not reduce the receive rate. Without network congestion, the ideal elapsed time for transferring B bytes to R replicas is B/T + RL where T is the network throughput and L is latency to transfer bytes between two machines.
Our network links are typically 100 Mbps (T ), and L is far below 1 ms.
Therefore, 1 MB can ideally be distributed in about 80 ms.

3.2數據流
我們將數據流與控制流分離，以有效地使用網絡。
當控制從客戶端流向主服務器，然後從所有輔助服務器傳輸時，數據將以流水線的方式沿着精心挑選的鏈接服務器線性推送。
我們的目標是充分利用每臺機器的網絡帶寬，避免網絡瓶頸和高延遲鏈路，並最大限度地減少所有數據的延遲。
爲了充分利用每個機器的網絡帶寬，數據沿着一組塊服務器線性地推送，而不是分佈在某些其他拓撲（例如樹）中。
因此，每個機器的完全出站帶寬用於儘可能快地傳輸數據，而不是在多個接收者之間分配。
爲了儘可能地避免網絡瓶頸和高延遲鏈路（例如，交換機間鏈路通常兩者），每個機器將數據轉發到尚未接收它的網絡拓撲中的“最接近”的機器。
假設客戶端正在將數據推送到塊服務器S1到S4。
它將數據發送到最近的chunkserver，說S1。
S1將其轉發到最接近S1的最接近的塊服務器S2至S4，如S2。
類似地，S2將其轉發到S3或S4，以較接近S2爲準，依此類推。
我們的網絡拓撲結構足夠簡單，可以從IP地址準確地估計“距離”。

最後，我們通過流水線連接TCP連接上的數據傳輸來最小化延遲。
一旦chunkserver接收到一些數據，它將立即開始轉發。
流水線對我們尤其有幫助，因爲我們使用具有全雙工鏈路的交換網絡。
立即發送數據不會降低接收速率。沒有網絡擁塞，將B字節傳輸到R副本的理想耗用時間是 B/T + RL，其中T是網絡吞吐量，L是在兩臺機器之間傳輸字節的延遲。
我們的網絡鏈路通常爲100 Mbps（T），L遠低於1 ms。
因此，1 MB可以理想地分佈在大約80毫秒。

3.3 Atomic Record Appends
GFS provides an atomic append operation called record append.
In a traditional write, the client specifies the offset at which data is to be written.
Concurrent writes to the same region are not serializable:
the region may end up containing data fragments from multiple clients.
In a record append, however, the client specifies only the data.
GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client.
This is similar to writing to a file opened in O APPEND mode in Unix without the race conditions when multiple writers do so concurrently.
Record append is heavily used by our distributed applications in which many clients on different machines append to the same file concurrently.
Clients would need additional complicated and expensive synchronization, for example through a distributed lock manager, if they do so with traditional writes.
In our workloads, such files often serve as multiple-producer/single-consumer queues or contain merged results from many different clients.
Record append is a kind of mutation and follows the control flow in Section 3.1 with only a little extra logic at the primary.
The client pushes the data to all replicas of the last chunk of the file Then, it sends its request to the primary.
The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB).
If so, it pads the chunk to the maximum size, tells secondaries to do the same, and replies to the client indicating that the operation should be retried on the next chunk.
(Record append is restricted to be at most one-fourth of the maximum chunk size to keep worstcase fragmentation at an acceptable level.)
If the record fits within the maximum size, which is the common case, the primary appends the data to its replica, tells the secondaries to write the data at the exact offset where it has, and finally replies success to the client.

3.3原子記錄附加
GFS提供了稱爲記錄追加的原子追加操作。
在傳統的寫入中，客戶端指定要寫入數據的偏移量。
併發寫入同一個區域是不可序列化的：
該區域可能最終包含來自多個客戶端的數據片段。
但是，在記錄追加中，客戶端僅指定數據。
GFS在GFS選擇的偏移處至少一次將其附加到文件至少一次（即，作爲一個連續的字節序列），並將該偏移量返回給客戶端。
這類似於在Unix中以O APPEND模式打開的文件，當多個作者同時進行時，不會出現競爭條件。
記錄附件被我們的分佈式應用程序大量使用，其中不同機器上的許多客戶端同時附加到同一個文件。
客戶端將需要額外的複雜和昂貴的同步，例如通過分佈式鎖管理器，如果他們用傳統的寫入方式。
在我們的工作負載中，這些文件通常用作多個生產者/單個消費者隊列，或者包含許多不同客戶端的合併結果。
記錄追加是一種突變，遵循3.1節中的控制流程，只有一點額外的邏輯。
客戶端將數據推送到文件的最後一個塊的所有副本。然後，它將其請求發送給主服務器。
primary 檢查是否將記錄附加到當前塊將導致塊超過最大大小（64 MB）。
如果是這樣，它將該塊填充到最大大小，告訴二進制程序執行相同的操作，並回復客戶端，指出應該在下一個塊上重試該操作。
（記錄追加限制爲最大塊大小的四分之一，以保持最壞的碎片在可接受的水平。）
如果記錄符合最大尺寸（通常是這種情況），則 primary 將數據附加到其副本，告訴二進制文件將數據寫入其所在的準確偏移量，最後向客戶端回覆成功。

If a record append fails at any replica, the client retries the operation.
As a result, replicas of the same chunk may contain different data possibly including duplicates of the same record in whole or in part.
GFS does not guarantee that all replicas are bytewise identical.
It only guarantees that the data is written at least once as an atomic unit.
This property follows readily from the simple observation that for the operation to report success, the data must have been written at the same offset on all replicas of some chunk.
Furthermore, after this, all replicas are at least as long as the end of record and therefore any future record will be assigned a higher offset or a different chunk even if a different replica later becomes the primary.
In terms of our consistency guarantees, the regions in which successful record append operations have written their data are defined (hence consistent),whereas intervening regions are inconsistent (hence undefined).
Our applications can deal with inconsistent regions as we discussed in Section 2.7.2.

如果任何副本上的記錄追加失敗，客戶端將重試該操作。
因此，相同塊的副本可能包含不同的數據，可能包括全部或部分同一記錄的重複。
GFS並不保證所有的副本都是相同的。
它只保證將數據作爲原子單元寫入至少一次。
這個屬性很容易從簡單的觀察結果來看，對於報告成功的操作，數據必須在一些塊的所有副本上以相同的偏移量寫入。
此外，在此之後，所有副本至少與記錄結束一樣長，因此即使稍後成爲 primary 副本，任何將來的記錄將被分配較高的偏移量或不同的塊。
在我們的一致性保證方面，成功記錄附加操作已經寫入數據的區域被定義（因此是一致的），而中間區域是不一致的（因此是未定義的）。
我們的應用可以處理不一致的區域，如我們在第2.7.2節中討論的。

3.4 Snapshot
The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations.
Our users use it to quickly create branch copies of huge data sets (and often copies of those copies, recursively), or to checkpoint the current state before experimenting with changes that can later be committed or rolled back easily.
Like AFS [5], we use standard copy-on-write techniques to implement snapshots. When the master receives a snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot.
This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder.
This will give the master an opportunity to create a new copy of the chunk first.
After the leases have been revoked or have expired, the master logs the operation to disk.
It then applies this log record to its in-memory state by duplicating the metadata for the source file or directory tree.
The newly created snapshot files point to the same chunks as the source files.
The first time a client wants to write to a chunk C after the snapshot operation, it sends a request to the master to find the current lease holder.
The master notices that the reference count for chunk C is greater than one.
It defers replying to the client request and instead picks a new chunk handle C’.
It then asks each chunkserver that has a current replica of C to create a new chunk called C’.
By creating the new chunk on the same chunkservers as the original, we ensure that the data can be copied locally, not over the network (our disks are about three times as fast as our 100 Mb Ethernet links).
From this point, request handling is no different from that for any chunk:
the master grants one of the replicas a lease on the new chunk C’ and replies to the client,which can write the chunk normally, not knowing that it has just been created from an existing chunk.

3.4快照
快照操作幾乎即時地複製文件或目錄樹（“源”），同時最小化正在進行的突變的任何中斷。
我們的用戶使用它來快速創建巨大數據集的分支副本（通常是這些副本的副本，遞歸地），或者在實驗可以稍後提交或回滾的更改之前檢查當前狀態。
像AFS [5]，我們使用標準的寫時複製技術實現快照。當主服務器收到快照請求時，它首先撤銷要快照的文件中的塊中的任何未完成的租約。
這確保了對這些塊的任何後續寫入將需要與主機的交互以找到租賃持有者。
這將給 master 一個機會，首先創建一個新的副本。
在租賃已經被撤銷或已經過期之後，主站將操作記錄到磁盤。
然後通過複製源文件或目錄樹的元數據將該日誌記錄應用到其內存中狀態。
新創建的快照文件指向與源文件相同的塊。
客戶端首次在快照操作後寫入塊C時，向主機發送請求以查找當前的租賃持有者。
master注意到C塊的引用計數大於1。
它阻止回覆客戶端請求，而是選擇一個新的塊處理C'。
然後，它要求具有C的當前副本的每個chunkserver創建一個名爲C'的新塊。
通過在與原始設備相同的塊服務器上創建新塊，我們確保數據可以在本地複製，而不是通過網絡複製（我們的磁盤大約是我們的100 Mb以太網鏈路的三倍）。
從這一點上，請求處理與任何塊都沒有區別：
master 授予一個副本一個新的塊C'上的租約，並回復客戶端，這可以正常寫入塊，不知道它剛剛從現有的塊中創建。