The Google File System : part5 FAULT TOLERANCE AND DIAGNOSIS

5.FAULT TOLERANCE AND DIAGNOSIS
One of our greatest challenges in designing the system is dealing with frequent component failures.
The quality and quantity of components together make these problems more the norm than the exception: we cannot completely trust the machines, nor can we completely trust the disks.
Component failures can result in an unavailable system or, worse, corrupted data.
We discuss how we meet these challenges and the tools we have built into the system to diagnose problems when they inevitably occur.

5.故障和診斷
設計系統的最大挑戰之一是處理頻繁的組件故障。
組件的質量和數量一起使這些問題比異常更加規範：我們不能完全信任機器，也不能完全信任磁盤。
組件故障可能導致不可用的系統或更糟的是數據損壞。
我們討論如何應對這些挑戰以及我們在系統中嵌入的工具，以便在不可避免地發生問題時進行診斷。

5.1 High Availability
Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time.
We keep the overall system highly available with two simple yet effective strategies: fast recovery and replication.

5.1高可用性
在GFS集羣中的數百臺服務器中，有些服務器在任何給定的時間都將不可用。
我們通過兩種簡單而有效的策略來保持整體系統的高度可用性：快速恢復和複製。

5.1.1 Fast Recovery
Both the master and the chunkserver are designed to restore their state and start in seconds no matter how they terminated.
In fact, we do not distinguish between normal and abnormal termination;
servers are routinely shut down just by killing the process.
Clients and other servers experience a minor hiccup as they time out on their outstanding requests, reconnect to the restarted server, and retry.
Section 6.2.2 reports observed startup times.

5.1.1快速恢復
主人和chunkserver都是爲了恢復他們的狀態，並在幾秒鐘內開始，無論他們如何終止。
其實我們不區分正常和異常終止;
服務器通常被關閉，只是通過殺死進程。
客戶端和其他服務器在出現未完成的請求超時時，會遇到輕微的打嗝，重新連接到重新啓動的服務器，然後重試。
6.2.2節報告了啓動時間。

5.1.2 Chunk Replication
As discussed earlier, each chunk is replicated on multiple chunkservers on different racks.
Users can specify different replication levels for different parts of the file namespace.
The default is three.
The master clones existing replicas as needed to keep each chunk fully replicated as chunkservers go offline or detect corrupted replicas through checksum verification (see Section 5.2).
Although replication has served us well, we are exploring other forms of cross-server redundancy such as parity or erasure codes for our increasing read-only storage requirements.
We expect that it is challenging but manageable to implement these more complicated redundancy schemes in our very loosely coupled system because our traffic is dominated by appends and reads rather than small random writes.

5.1.2塊複製
如前所述，每個塊在不同機架上的多個塊服務器上進行復制。
用戶可以爲文件命名空間的不同部分指定不同的複製級別。
默認值爲3。
主人根據需要克隆現有的副本，以保持每個塊完全複製，因爲chunkservers脫機或通過校驗和驗證檢測損壞的副本（參見第5.2節）。
雖然複製對我們很好，但我們正在探索其他形式的跨服務器冗餘，如奇偶校驗或擦除代碼，以增加只讀存儲要求。
我們預計在我們非常鬆散耦合的系統中實現這些更復雜的冗餘方案是有挑戰性的，但是可以管理，因爲我們的流量由附加和讀取而不是小隨機寫入支配。

5.1.3 Master Replication
The master state is replicated for reliability.
Its operation log and checkpoints are replicated on multiple machines.
A mutation to the state is considered committed only after its log record has been flushed to disk locally and on all master replicas.
For simplicity, one master process remains in charge of all mutations as well as background activities such as garbage collection that change the system internally.
When it fails, it can restart almost instantly.
If its machine or disk fails, monitoring infrastructure outside GFS starts a new master process elsewhere with the replicated operation log.
Clients use only the canonical name of the master (e.g. gfs-test), which is a DNS alias that can be changed if the master is relocated to another machine.

5.1.3主複製
複製主狀態的可靠性。
其操作日誌和檢查點在多臺機器上進行復制。
只有在其日誌記錄已刷新到本地和所有主副本的磁盤之後，狀態的突變才被視爲已提交。
爲了簡單起見，一個主程序仍然負責所有突變以及背景活動，如內部改變系統的垃圾收集。
當它失敗時，它可以立即重新啓動。
如果其機器或磁盤出現故障，則在GFS之外監視基礎設施會在複製操作日誌的其他地方啓動新的主進程。
客戶端僅使用主服務器的規範名稱（例如，gfs-test），如果主服務器重新定位到另一臺機器，則可以更改DNS別名。

Moreover, “shadow” masters provide read-only access to the file system even when the primary master is down.
They are shadows, not mirrors, in that they may lag the primary slightly, typically fractions of a second.
They enhance read availability for files that are not being actively mutated or applications that do not mind getting slightly stale results.
In fact, since file content is read from chunkservers, applications do not observe stale file content.
What could be stale within short windows is file metadata, like directory contents or access control information.

此外，即使主主機關閉，“影子”主機也提供對文件系統的只讀訪問。
它們是陰影，而不是鏡子，因爲它們可能稍微滯後於主要部分，通常是秒數。
它們增強了未被主動突變的文件的讀取可用性，或者不介意稍微過時的結果的應用程序的讀取可用性。
實際上，由於從chunkserver中讀取文件內容，所以應用程序不會看到不正常的文件內容。
文件元數據，如目錄內容或訪問控制信息，在短窗口內可能會過時。

To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of
changes to its data structures exactly as the primary does.
Like the primary, it polls chunkservers at startup (and infrequently thereafter) to locate chunk replicas and exchanges frequent handshake messages with them to monitor their status.
It depends on the primary master only for replica location updates resulting from the primary’s decisions to create and delete replicas.

爲了保持自己的瞭解，影子大師讀取不斷增長的操作日誌的副本，並應用相同的順序
與數據結構完全一樣的改變。
像主要的一樣，它在啓動時輪詢chunkservers（之後很少）來定位塊副本，並與他們交換頻繁的握手信息來監視他們的狀態。
這取決於主要主人僅對由主要決定創建和刪除副本而產生的副本位置更新。

5.2 Data Integrity
Each chunkserver uses checksumming to detect corruption of stored data.
Given that a GFS cluster often has thousands of disks on hundreds of machines, it regularly experiences disk failures that cause data corruption or loss on both the read and write paths. (See Section 7 for one cause.)
We can recover from corruption using other chunk replicas, but it would be impractical to detect corruption by comparing replicas across chunkservers.
Moreover, divergent replicas may be legal:
the semantics of GFS mutations, in particular atomic record append as discussed earlier, does not guarantee identical replicas.
Therefore, each chunkserver must independently verify the integrity of its own copy by maintaining checksums.
A chunk is broken up into 64 KB blocks.
Each has a corresponding 32 bit checksum.
Like other metadata, checksums are kept in memory and stored persistently with logging, separate from user data.

5.2數據完整性
每個chunkserver使用校驗和來檢測存儲數據的損壞。
鑑於GFS集羣在數百臺計算機上經常有數千個磁盤，它經常會遇到導致讀取和寫入路徑上的數據損壞或丟失的磁盤故障。（有關一個原因，請參見第7節。）
我們可以使用其他塊副本從損壞中恢復，但通過比較副本在塊訪問器之間來檢測損壞是不實際的。
此外，不同的副本可能是合法的：
GFS突變的語義，特別是前面討論的原子記錄附錄，並不能保證相同的複製品。
因此，每個chunkserver必須通過維持校驗和獨立地驗證自己的副本的完整性。
一個塊被分解成64 KB塊。
每個都有相應的32位校驗和。
像其他元數據一樣，校驗和保存在內存中，並與用戶數據分開存儲，並與日誌記錄持久存儲。

For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data to the requester, whether a client or another chunkserver.
Therefore chunkservers will not propagate corruptions to other machines.
If a block does not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master.
In response, the requestor will read from other replicas, while the master will clone the chunk from another replica.
After a valid new replica is in place, the master instructs the chunkserver that reported the mismatch to delete its replica.
Checksumming has little effect on read performance for several reasons.
Since most of our reads span at least a few blocks, we need to read and checksum only a relatively small amount of extra data for verification.
GFS client code further reduces this overhead by trying to align reads at checksum block boundaries.
Moreover, checksum lookups and comparison on the chunkserver are done without any I/O, and checksum calculation can often be overlapped with I/Os.

對於讀取，chunkserver在將任何數據返回給請求者（無論是客戶機還是其他chunkserver）之前驗證與讀取範圍重疊的數據塊的校驗和。
因此，塊服務器不會將損壞傳播到其他機器。
如果塊與記錄的校驗和不匹配，則chunkserver將向請求者返回錯誤，並向主機報告不匹配。
作爲響應，請求者將從其他副本讀取，而主人將從另一個副本克隆該塊。
在有效的新副本到位後，主人指示報告不匹配的chunkserver刪除其副本。
校驗和對於讀取性能影響不大，原因有幾個。
由於我們的讀取大多數至少有幾個塊，所以我們需要讀取和校驗，只需要相對較少的額外的數據進行驗證。
GFS客戶端代碼通過嘗試將校驗和塊邊界的讀取對齊來進一步減少此開銷。
此外，在沒有任何I / O的情況下完成了chunkserver的校驗和查找和比較，校驗和計算通常可以與I / O重疊。

Checksum computation is heavily optimized for writes that append to the end of a chunk (as opposed to writes that overwrite existing data) because they are dominant in our workloads.
We just incrementally update the checksum for the last partial checksum block, and compute new checksums for any brand new checksum blocks filled by the append.
Even if the last partial checksum block is already corrupted and we fail to detect it now, the new checksum value will not match the stored data, and the corruption will be detected as usual when the block is next read.
In contrast, if a write overwrites an existing range of the chunk, we must read and verify the first and last blocks of the range being overwritten, then perform the write, and finally compute and record the new checksums.
If we do not verify the first and last blocks before overwriting them partially, the new checksums may hide corruption that exists in the regions not being overwritten.
During idle periods, chunkservers can scan and verify the contents of inactive chunks. This allows us to detect corruption in chunks that are rarely read.
Once the corruption is detected, the master can create a new uncorrupted replica and delete the corrupted replica.
This prevents an inactive but corrupted chunk replica from fooling the master into thinking that it has enough valid replicas of a chunk.

由於它們在我們的工作負載中占主導地位，所以對於追加到塊結尾（而不是覆蓋現有數據的寫入）的寫入，校驗和計算進行了大量優化。
我們只是遞增地更新最後一個部分校驗和塊的校驗和，並計算新的校驗和，以獲取由該附加程序填充的任何全新校驗和塊。
即使最後的部分校驗和塊已經損壞，我們現在無法檢測到，新的校驗和值將與存儲的數據不匹配，並且當下一次讀取塊時，會像通常一樣檢測到損壞。
相反，如果寫入覆蓋塊的現有範圍，則必須讀取並驗證被覆蓋的範圍的第一個和最後一個塊，然後執行寫入，最後計算並記錄新的校驗和。
如果我們在部分覆蓋之前不驗證第一個和最後一個塊，則新的校驗和可能會隱藏不被覆蓋的區域中存在的損壞。
在空閒期間，chunkserver可以掃描和驗證非活動塊的內容。這樣我們可以檢測到很少讀取的塊中的腐敗。
檢測到損壞後，主機可以創建新的未損壞的副本並刪除損壞的副本。
這樣可以防止一個不活躍但已損壞的塊副本從欺騙主人的角度來認爲它具有足夠的有效副本塊。

5.3 Diagnostic Tools
Extensive and detailed diagnostic logging has helped immeasurably in problem isolation, debugging, and performance analysis, while incurring only a minimal cost.
Without logs, it is hard to understand transient, non-repeatable interactions between machines.
GFS servers generate diagnostic logs that record many significant events (such as chunkservers going up and down) and all RPC requests and replies.
These diagnostic logs can be freely deleted without affecting the correctness of the system.
However, we try to keep these logs around as far as space permits.
The RPC logs include the exact requests and responses sent on the wire, except for the file data being read or written.
By matching requests with replies and collating RPC records on different machines, we can reconstruct the entire interaction history to diagnose a problem.
The logs also serve as traces for load testing and performance analysis.
The performance impact of logging is minimal (and far outweighed by the benefits) because these logs are written sequentially and asynchronously.
The most recent events are also kept in memory and available for continuous online monitoring.

5.3診斷工具
廣泛和詳細的診斷記錄在問題隔離，調試和性能分析方面做出了不可估量的貢獻，同時只產生了最低的成本。
沒有日誌，很難理解機器之間的瞬態，不可重複的交互。
GFS服務器生成記錄許多重要事件（如上下文中的chunkservers）和所有RPC請求和回覆的診斷日誌。
這些診斷日誌可以自由刪除，而不會影響系統的正確性。
但是，我們儘量將這些日誌保留在空間許可之下。
RPC日誌包括在線上發送的確切請求和響應，但文件數據被讀取或寫入除外。
通過將請求與回覆匹配並將不同機器上的RPC記錄進行整理，我們可以重建整個交互記錄以診斷問題。
日誌還用作負載測試和性能分析的蹤跡。
日誌記錄的性能影響最小（遠遠超過優點），因爲這些日誌是順序和異步地編寫的。
最新的事件也保存在內存中，可用於連續在線監控。

poeticHadoop

發佈了63 篇原創文章 · 獲贊 24 · 訪問量 6萬+

私信關注

The Google File System : part5 FAULT TOLERANCE AND DIAGNOSIS

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

網絡爬蟲的祕密：如何高效地抓取JD.com視頻鏈接

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Bigtable: A Distributed Storage System for Structured Data : part8 Real Applications

The Google File System : part7 EXPERIENCES

創建Scrapy項目，Ubuntu 16.04 Python3.5 pip3

The Google File System : part4 MASTER OPERATION

Bigtable: A Distributed Storage System for Structured Data : part9 Lessons

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結