The Google File System : part2 DESIGN OVERVIEW

2.DESIGN OVERVIEW
2.1 Assumptions
In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. 
We alluded to some key observations earlier and now lay out our assumptions in more details.
2.設計概述
2.1假設
在爲我們的需求設計文件系統時,我們以提供挑戰和機會的假設爲指導。
我們早些時候提到了一些重要的觀點,現在更詳細地闡述了我們的假設。

(1)
The system is built from many inexpensive commodity components that often fail. 
It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
(2)
The system stores a modest number of large files. 
We expect a few million files, each typically 100 MB or larger in size. 
Multi-GB files are the common case and should be managed efficiently.
Small files must be supported, but we need not optimize for them.
(3)
The workloads primarily consist of two kinds of reads:
large streaming reads and small random reads. 
In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more.
Successive operations from the same client often read through a contiguous region of a file. 
A small random read typically reads a few KBs at some arbitrary offset. 
Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go back and forth.
(4)
The workloads also have many large, sequential writes that append data to files. 
Typical operation sizes are similar to those for reads. 
Once written, files are seldom modified again. 
Small writes at arbitrary positions in a file are supported but do not have to be efficient.
(5)
The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. 
Our files are often used as producer-consumer queues or for many-way merging. 
Hundreds of producers, running one per machine, will concurrently append to a file. 
Atomicity with minimal synchronization overhead is essential. 
The file may be read later, or a consumer may be reading through the file simultaneously.
(6)
High sustained bandwidth is more important than low latency. 
Most of our target applications place a premium on processing data in bulk at a high rate, while few have stringent response time requirements for an individual read or write.

(1)
該系統由許多經常失敗的廉價商品組件構成。
它必須不斷監控自身,並且可以從常規的基礎上檢測,容忍和及時恢復組件故障。
(2)
系統存儲適量的大文件。
我們預計有幾百萬個文件,每個文件大小通常爲100 MB或更大。
多GB文件是常見的情況,應有效管理。
必須支持小文件,但是我們不需要對它們進行優化。
(3)工作量主要包括兩種:
大流讀取和小隨機讀取。
在大型流讀取中,單個操作通常讀取數百個KB,更常見的是1 MB或更多。
來自同一客戶端的連續操作通常會讀取文件的連續區域。
一個小的隨機讀取通常在某個任意偏移量讀取幾KB。
性能意識的應用程序通常會批量排列小讀數,以便穩定地通過文件,而不是來回移動。
(4)
這些工作負載還有許多大量的順序寫入,可以將數據附加到文件中。
典型的操作尺寸與讀取相似。
一旦寫入,文件很少被修改。
支持文件中任意位置的小寫,但不必高效。
(5)
系統必須爲同時附加到同一文件的多個客戶端高效地實現定義良好的語義。
我們的文件通常用作生產者 - 消費者隊列或多路合併。
數以百計的生產商,每臺機器上運行一個,將同時附加到一個文件。
具有最小同步開銷的原子性至關重要。
該文件可以稍後讀取,或者消費者可能同時讀取文件。
(6)
高持續帶寬比低延遲更重要。
我們的大多數目標應用程序以高速率批量處理數據,而對單個讀取或寫入的響應時間要求較少。

2.2 Interface
GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. 
Files are organized hierarchically in directories and identified by path names. 
We support the usual operations to create, delete,open, close, read, and write files.
Moreover, GFS has snapshot and record append operations. 
Snapshot creates a copy of a file or a directory tree at low cost. 
Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. 
It is useful for implementing multi-way merge results and producer-consumer queues that many clients can simultaneously append to without additional locking. 
We have found these types of files to be invaluable in building large distributed applications. 
Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively.

2.2接口
GFS提供了一個熟悉的文件系統界面,儘管它沒有實現標準的API,如POSIX。
文件在目錄中分層組織,並由路徑名稱標識。
我們支持創建,刪除,打開,關閉,讀取和寫入文件的常規操作。
此外,GFS具有快照和記錄追加操作。
快照以低成本創建文件或目錄樹的副本。
記錄附件允許多個客戶端同時將數據附加到同一個文件,同時保證每個客戶端的附加信息的原子性。
它對於實現多路合併結果和生產者 - 消費者隊列很有用,許多客戶端可以同時附加到,而不需要額外的鎖定。
我們發現這些類型的文件在構建大型分佈式應用程序中是非常寶貴的。
第3.4節和第3.3節進一步討論了快照和記錄附錄。

2.3 Architecture

A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. 


Each of these is typically a commodity Linux machine running a user-level server process. 
It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. 
Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation.
Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. 
For reliability, each chunk is replicated on multiple chunkservers. 
By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace.

The master maintains all file system metadata. 
This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks.
It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. 
The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.
GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. 
Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. 
We do not provide the POSIX API and therefore need not hook into the Linux vnode layer.
Neither the client nor the chunkserver caches file data.
Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. 
Not having them simplifies the client and the overall system by eliminating cache coherence issues.(Clients do cache metadata, however.) 
Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.

2.3 結構
GFS羣集由單個主站和多個塊服務器組成,並由多個客戶機訪問,如圖1所示。
這些通常是運行用戶級服務器進程的商品Linux機器。
只要機器資源允許,運行可能片狀的應用程序代碼導致的可靠性較低,可以輕鬆地在同一機器上同時運行chunkserver和客戶端。
文件分爲固定大小的塊。
每個塊由塊創建時由主機分配的不可變且全局唯一的64位塊句柄標識。
Chunkserver將本地磁盤上的塊存儲爲 Linux 文件,並讀取或寫入由塊處理和字節範圍指定的塊數據。
爲了可靠性,每個塊都複製在多個塊服務器上。
默認情況下,我們存儲三個副本,儘管用戶可以爲文件命名空間的不同區域指定不同的複製級別。

主人維護所有文件系統元數據。
這包括命名空間,訪問控制信息,從文件到塊的映射以及塊的當前位置。
它還控制系統範圍的活動,如塊租約管理,垃圾收集孤兒塊,以及chunkserver之間的塊遷移。
主人週期性地與HeartBeat消息中的每個chunkserver進行通信,給出指令並收集其狀態。
鏈接到每個應用程序的GFS客戶端代碼實現文件系統API,並與主機和組塊服務器通信,代表應用程序讀取或寫入數據。
客戶端與主服務器進行元數據操作交互,但是所有數據承載通信都直接連接到塊服務器。
我們不提供POSIX API,因此不需要掛接到Linux vnode層。
客戶端和chunkserver都不會緩存文件數據。
客戶端緩存沒有什麼好處,因爲大多數應用程序都流過大型文件或者工作集太大而無法緩存。
沒有它們通過消除緩存一致性問題來簡化客戶端和整個系統(客戶端執行緩存元數據)。
塊服務器不需要緩存文件數據,因爲塊被存儲爲本地文件,因此Linux的緩衝區高速緩存已經將內存中經常訪問的數據保存下來。

2.4 Single Master
Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. 
However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. 
Clients never read and write file data through the master. 

Instead, a client asks the master which chunkservers it should contact. 
It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.
Let us explain the interactions for a simple read with reference to Figure 1. 
First, using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file.
Then, it sends the master a request containing the file name and chunk index. 
The master replies with the corresponding chunk handle and locations of the replicas. 
The client caches this information using the file name and chunk index as the key.
The client then sends a request to one of the replicas, most likely the closest one. 
The request specifies the chunk handle and a byte range within that chunk. 
Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened.
In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. 
This extra information sidesteps several future client-master interactions at practically no extra cost.

2.4 單一的 Master
擁有一個主人大大簡化了我們的設計,並使主人能夠使用全球知識製作複雜的塊佈局和複製決策。
但是,我們必須儘量減少對讀寫的參與,從而不會成爲瓶頸。
客戶端從不通過主機讀取和寫入文件數據。

相反,客戶端向主機詢問應該聯繫哪個chunkserver。
它在有限的時間內緩存此信息,並直接與chunkservers進行交互以進行許多後續操作。
讓我們來看一下簡單閱讀的交互作用,參考圖1。
首先,使用固定塊大小,客戶端將應用程序指定的文件名和字節偏移量轉換爲文件中的塊索引。
然後,它向主機發送包含文件名和塊索引的請求。
主人回覆相應的塊處理和副本的位置。
客戶端使用文件名和塊索引作爲關鍵字緩存此信息。
然後,客戶端向其中一個副本發送請求,最有可能是最接近的副本。
該請求指定該塊中的塊句柄和一個字節範圍。
對於相同的塊的進一步讀取不需要更多的客戶端 - 主機交互,直到緩存的信息到期或文件被重新打開爲止。
實際上,客戶端通常在相同的請求中要求多個塊,並且主機還可以包括緊接在請求之後的塊的信息。
這些額外的信息避免了幾個未來的客戶 - 主機交互,幾乎沒有額外的成本。

2.5 Chunk Size
Chunk size is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system block sizes. 
Each chunk replica is stored as a plain Linux file on a chunkserver and is extended only as needed.
Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size.
A large chunk size offers several important advantages.
First, it reduces clients’ need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. 
The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. 
Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. 
Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. 
Third, it reduces the size of the metadata stored on the master. 
This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1.
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. 
A small file consists of a small number of chunks, perhaps just one. 
The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. 
In practice, hot spots have not been a major issue because our applications mostly read large multi-chunk files sequentially.
However, hot spots did develop when GFS was first used by a batch-queue system: 
an executable was written to GFS as a single-chunk file and then started on hundreds of machines at the same time. 
The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. 
We fixed this problem by storing such executables with a higher replication factor and by making the batch-queue system stagger application start times. 
A potential long-term solution is to allow clients to read data from other clients in such situations.

2.5 塊大小
塊大小是關鍵設計參數之一。我們選擇了64 MB,比典型的文件系統塊大小要大得多。
每個塊副本都作爲純文件存儲在chunkserver中,只能根據需要進行擴展。
懶惰空間分配避免了由於內部碎片而浪費空間,也許是對這樣大塊大小的最大的反對意見。
大塊大小提供了幾個重要的優點。
首先,它減少了客戶端與主機交互的需要,因爲在相同的組塊上的讀取和寫入只需要一個初始請求給主機來獲取組塊位置信息。
這種減少對於我們的工作負載特別重要,因爲應用程序主要是依次讀寫大文件。
即使對於小的隨機讀取,客戶端也可以舒適地緩存多TB工作集的所有塊位置信息。
第二,由於在大塊上,客戶端更有可能在給定的塊上執行許多操作,它可以通過在長時間內保持與chunkserver的持續TCP連接來減少網絡開銷。
第三,它減少了存儲在主機上的元數據的大小。
這允許我們將元數據保留在內存中,這反過來又帶來了我們將在2.6.1節中討論的其他優點。
另一方面,大塊大小,即使是空閒的空間分配,也有其缺點。
一個小文件由少量的塊組成,也許只有一個。
如果許多客戶端訪問同一個文件,那麼存儲這些塊的塊服務器可能會成爲熱點。
實際上,熱點並不是一個主要問題,因爲我們的應用程序主要是依次讀取大塊多個文件。
然而,當GFS首次被批處理隊列系統使用時,熱點確實發生了:
一個可執行文件被寫入GFS作爲單塊文件,然後同時在數百臺機器上啓動。
存儲此可執行文件的幾個chunkserver由數百個同時發送的請求重載。
我們通過存儲具有更高複製因子的可執行文件並使批處理隊列系統交錯應用程序啓動時間來修復此問題。
潛在的長期解決方案是允許客戶端在這種情況下從其他客戶端讀取數據。

2.6 Metadata
The master stores three major types of metadata: 
the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. 
The first two types (namespaces and file-to-chunk mapping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines. 
Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. 
The master does not store chunk location information persistently. 
Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.

2.6 Metadata
主人存儲三種主要類型的元數據:
文件和塊命名空間,從文件到塊的映射以及每個塊的副本的位置。 所有元數據都保存在主內存中。
前兩種類型(命名空間和文件到組塊映射)也通過將突變記錄到存儲在主節點本地磁盤上並在遠程計算機上覆制的操作日誌來保持持久性。
使用日誌允許我們簡單,可靠地更新主狀態,並且在主機崩潰時不會有不一致的風險。
主人不會持久存儲塊位置信息。
相反,它會在主啓動時以及每個chunkserver加入羣集時向每個chunkserver詢問其塊。

2.6.1 In-Memory Data Structures
Since metadata is stored in memory, master operations are fast. 
Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background.
This periodic scanning is used to implement chunk garbage collection, re-replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers. 
Sections 4.3 and 4.4 will discuss these activities further.

One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. 
This is not a serious limitation in practice. 
The master maintains less than 64 bytes of metadata for each 64 MB chunk. 
Most chunks are full because most files contain many chunks, only the last of which may be partially filled. 
Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory.

2.6.1內存數據結構
由於元數據存儲在內存中,主操作很快。
此外,主機在後臺週期性地掃描整個狀態是容易和有效的。
此定期掃描用於在存在chunkserver故障的情況下實現塊垃圾收集,重新複製,以及塊遷移以平衡塊服務器之間的負載和磁盤空間使用情況。
4.3和4.4節將進一步討論這些活動。

這種僅用於存儲器的方法的一個潛在問題是,塊的數量以及整個系統的容量受到主機具有多少存儲器的限制。
這在實踐中不是嚴重的限制。
主機爲每個64 MB組塊維護少於64字節的元數據。
大多數塊是滿的,因爲大多數文件包含很多塊,只有最後一個可能部分填充。
類似地,文件命名空間數據通常每個文件需要少於64個字節,因爲它使用前綴壓縮緊湊地存儲文件名。
如果需要支持更大的文件系統,向主機添加額外內存的成本是通過將元數據存儲在內存中獲得的簡單性,可靠性,性能和靈活性來支付的一個很小的代價。

2.6.2 Chunk Locations
The master does not keep a persistent record of which chunkservers have a replica of a given chunk. 
It simply polls chunkservers for that information at startup. 
The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages.
We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. 
This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. 
In a cluster with hundreds of servers, these events happen all too often.
Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. 
There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause
chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver.

2.6.2塊位置
主人不會保留持久性記錄,其中哪個chunkservers具有給定塊的副本。
它只是在啓動時輪詢該塊信息。
主人可以隨時保持自己的最新狀態,因爲它控制所有的塊放置,並通過常規HeartBeat消息監視chunkserver狀態。
我們最初嘗試將主持人的位置信息持續保留,但是我們決定在啓動時請求來自塊服務器的數據更爲簡單,並在此之後定期進行。
這消除了保持主服務器和chunkservers同步的問題,因爲chunkservers加入並離開集羣,更改名稱,失敗,重新啓動等等。
在具有數百臺服務器的羣集中,這些事件經常發生。
理解這種設計決策的另一種方法是意識到一個chunkserver在其自己的磁盤上具有哪些塊或者沒有的塊的最後一個字。
嘗試保持對主機上的此信息的一致視圖是沒有意義的,因爲chunkserver上的錯誤可能會導致
大塊自動消失(例如,磁盤可能會壞,被禁用),或者操作員可能會重命名chunkserver。

2.6.3 Operation Log
The operation log contains a historical record of critical metadata changes. 
It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. 
Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created.
Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. 
Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. 
Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. 
The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput.
The master recovers its file system state by replaying the operation log. 
To minimize startup time, we must keep the log small. 
The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the limited number of log records after that. 
The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. 
This further speeds up recovery and improves availability.
Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. 
The master switches to a new log file and creates the new checkpoint in a separate thread. 
The new checkpoint includes all mutations before the switch. 
It can be created in a minute or so for a cluster with a few million files. 
When completed, it is written to disk both locally and remotely.
Recovery needs only the latest complete checkpoint and subsequent log files. 
Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. 
A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints.

2.6.3操作日誌
操作日誌包含關鍵元數據更改的歷史記錄。
它是GFS的核心。它不僅是元數據的唯一永久性記錄,而且還作爲定義併發操作順序的邏輯時間線。
文件和塊,以及它們的版本(見第4.5節)都是由它們創建的邏輯時間唯一和永久地確定的。
由於操作日誌至關重要,因此我們必須可靠地存儲它,而不會使更改對客戶端可見,直到元數據更改被持久化爲止。
否則,我們有效地丟失了整個文件系統或最近的客戶端操作,即使這些塊本身生存下來。
因此,我們將其複製到多個遠程計算機上,並在本地和遠程將相應的日誌記錄刷新到磁盤後才響應客戶端操作。
主機在沖洗之前將幾個日誌記錄批在一起,從而減少衝洗和複製對整個系統吞吐量的影響。
主機通過重播操作日誌恢復其文件系統狀態。
爲了最大限度地減少啓動時間,我們必須保持日誌小。
主機檢查它的狀態,只要日誌增長超過一定大小,以便它可以通過從本地磁盤加載最新的檢查點來恢復,然後僅重播有限數量的日誌記錄。
檢查點是一個緊湊的B樹形式,可以直接映射到內存中,並用於命名空間查找,無需額外的解析。
這進一步加快了恢復速度並提高了可用性
因爲構建檢查點可能需要一段時間,所以主機的內部狀態的結構使得可以創建新的檢查點而不會延遲進入的突變。
主人切換到新的日誌文件,並在單獨的線程中創建新的檢查點。
新檢查點包括切換前的所有突變。
對於具有幾百萬個文件的集羣,它可以在一分鐘內創建。
完成後,將本地和遠程寫入磁盤。
恢復只需要最新的完整檢查點和後續日誌文件。
較舊的檢查點和日誌文件可以自由刪除,儘管我們保留幾個來防範災難。
檢查點期間的故障不會影響正確性,因爲恢復代碼檢測並跳過不完整的檢查點。

2.7 Consistency Model
GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. 
We now discuss GFS’s guarantees and what they mean to applications. 
We also highlight how GFS maintains these guarantees but leave the details to other parts of the paper.

2.7一致性模型
GFS具有輕鬆的一致性模型,可以很好地支持我們高度分佈的應用程序,但仍然相對簡單和高效地實現。
我們現在討論GFS的保證以及它們對應用程序的意義。
我們還強調了GFS如何保持這些保證,但將細節留給本文件的其他部分。

2.7.1 Guarantees by GFS
File namespace mutations (e.g., file creation) are atomic.
They are handled exclusively by the master: namespace locking guarantees atomicity and correctness (Section 4.1);
the master’s operation log defines a global total order of these operations (Section 2.6.3).
The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. 

Table 1 summarizes the result. 


A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. 
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. 
When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): 
all clients will always see what the mutation has written. 
Concurrent successful mutations leave the region undefined but consistent: 
all clients see the same data, but it may not reflect what any one mutation has written. 
Typically, it consists of mingled fragments from multiple mutations. 
A failed mutation makes the region inconsistent (hence also undefined): 
different clients may see different data at different times. 
We describe below how our applications can distinguish defined regions from undefined regions. 
The applications do not need to further distinguish between different kinds of undefined regions.
Data mutations may be writes or record appends. 
A write causes data to be written at an application-specified file offset. 
A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). 
(In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) 
The offset is returned to the client and marks the beginning of a defined region that contains the record.
In addition, GFS may insert padding or record duplicates in between. 
They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data.
After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. 
GFS achieves this by (a) applying mutations to a chunk in the same order on all its replicas(Section 3.1), and (b) using chunk version numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). 
Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. 
They are garbage collected at the earliest opportunity.
Since clients cache chunk locations, they may read from a stale replica before that information is refreshed. 
This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunk information for that file. 
Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunk rather than outdated data. 
When a reader retries and contacts the master, it will immediately get current chunk locations.
Long after a successful mutation, component failures can of course still corrupt or destroy data. 
GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming
(Section 5.2). 
Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). 
A chunk is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. 
Even in this case, it becomes unavailable, not corrupted: 
applications receive clear errors rather than corrupt data.

2.7.1 GFS的保證
文件命名空間突變(例如,文件創建)是原子的。
它們由主機專門處理:命名空間鎖定保證原子性和正確性(第4.1節);
主操作日誌定義了這些操作的全局總順序(第2.6.3節)。
數據突變後文件區的狀態取決於突變的類型,是否成功或失敗,以及是否存在同時發生突變。
表1總結了結果。
如果所有客戶端總是看到相同的數據,無論他們讀取哪些副本,文件區域都是一致的。
文件數據變異之後定義區域是一致的,並且客戶端將會看到全部變更的內容。
當突變成功而沒有來自併發作者的干擾時,受影響的區域被定義(並且通過含義一致):
所有客戶將始終看到突變的寫作。
併發成功的突變使該地區未定義但一致:
所有客戶都看到相同的數據,但它可能不會反映任何一個突變所寫的內容。
通常,它由來自多個突變的混合片段組成。
失敗的突變使得該區域不一致(因此也是未定義的):
不同的客戶可能會在不同的時間看到不同的數據
下面我們將描述我們的應用程序如何區分定義的區域和未定義的區域。
應用程序不需要進一步區分不同類型的未定義區域。
數據突變可能是寫入或記錄追加。
寫入會使數據以應用程序指定的文件偏移量寫入。
即使在存在併發突變的情況下,記錄追加也會將數據(“記錄”)原子附加至少一次,但與GFS的選擇相抵消(第3.3節)。
(相比之下,“常規”附件只是在客戶端認爲是當前文件結尾的偏移量上寫入。)
偏移量返回給客戶端,並標記包含記錄的定義區域的開頭。
此外,GFS可能會在其間插入填充或重複記錄。
它們佔據被認爲不一致的區域,並且通常與用戶數據量相比較。
經過一系列成功的突變後,突變的文件區域被保證被定義幷包含最後一個突變所寫的數據。
GFS通過以下方式實現這一點:
(a)在其所有副本上以相同順序對塊進行突變(第3.1節),和(b)使用塊版本號來檢測已經變得陳舊的複製品,因爲它的chunkserver爲(第4.5節)。
陳舊的複製品永遠不會涉及突變,也不會給客戶詢問主人的塊位置。
他們是最早收集的垃圾。
由於客戶端緩存塊位置,所以在刷新信息之前,它們可能會從陳舊的副本中讀取。
此窗口受到緩存條目的超時和文件的下一次打開的限制,從緩存中清除該文件的所有塊信息。
而且,由於我們大多數的文件只是附加的,所以一個陳舊的副本通常返回一個過早的塊而不是過時的數據。
當讀者重試並聯系主人時,它將立即獲得當前的塊位置。
成功突變後,組件故障當然可能會損壞或破壞數據。
GFS通過master和所有chunkserver之間的常規握手來識別失敗的chunkservers,並通過校驗和檢測數據損壞(第5.2節)。
一旦出現問題,數據將盡快從有效副本中恢復(第4.3節)。
只有在GFS可以做出反應之前,如果所有的副本都丟失,那麼這個塊纔會不可逆轉,通常在幾分鐘之內。
即使在這種情況下,它變得不可用,不會損壞:應用程序收到清除錯誤,而不是損壞的數據。

2.7.2 Implications for Applications
GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: 
relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records.
Practically all our applications mutate files by appending rather than overwriting. 
In one typical use, a writer generates a file from beginning to end. 
It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. 
Checkpoints may also include application-level checksums. 
Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. 
Regardless of consistency and concurrency issues, this approach has served us well. 
Appending is far more efficient and more resilient to application failures than random writes. 
Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective.
In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. 

Record append’s append-at-least-once semantics preserves each writer’s output. 
Readers deal with the occasional padding and duplicates as follows. 
Each record prepared by the writer contains extra information like checksums so that its validity can be verified. 
A reader can identify and discard extra padding and record fragments using the checksums. 
If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. 
These functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. 
With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader.

2.7.2對應用的啓示
GFS應用程序可以通過其他目的已經需要的幾種簡單技術來適應輕鬆的一致性模型:
依靠追加而不是覆蓋,檢查點和編寫自我驗證的自我識別記錄。
實際上,我們的所有應用程序通過附加而不是覆蓋來突變文件。
在一個典型的使用中,寫入器從頭到尾生成文件。
在寫入所有數據之後,它將文件原子地重命名爲永久名稱,或者定期檢查點已經成功寫入了多少。
檢查點還可能包括應用程序級別的校驗和。
讀者僅驗證並處理直到最後一個已知處於定義狀態的檢查點的文件區域。
無論一致性和併發性問題如何,這種方法對我們都很好。
對於應用程序故障而言,附加功能比隨機寫入更有效率,更具彈性。
檢查點允許作者逐步重新啓動,使讀者不會處理從應用程序的角度來看仍然不完整的成功編寫的文件數據。
在其他典型用途中,許多作者同時附加到一個文件中以便合併結果或作爲生產者 - 消費者隊列。

記錄附加的至少一次語義保留每個作者的輸出。
讀者處理偶爾的填充和重複如下。
作者準備的每個記錄都包含額外的信息,如校驗和,以便驗證其有效性。
讀者可以使用校驗和識別和丟棄額外的填充和記錄片段。
如果它不能容忍偶爾的重複(例如,如果它們將觸發非冪等操作),則它可以使用記錄中的唯一標識符來過濾它們,這些標識通常被稱爲相應的應用實體(例如web文檔)。
用於記錄 I/O 的功能(除了重複刪除)在我們的應用程序共享的庫代碼中,適用於Google的其他文件接口實現。
由此,記錄讀取器始終將相同的記錄順序加上罕見的重複記錄。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章