The Google File System : part4 MASTER OPERATION

4. MASTER OPERATION
The master executes all namespace operations. 
In addition, it manages chunk replicas throughout the system: 
it makes placement decisions, creates new chunks and hence replicas, and coordinates various system-wide activities to keep chunks fully replicated, to balance load across all the chunkservers, and to reclaim unused storage. 
We now discuss each of these topics.

4.主操作
master 執行所有命名空間操作。
此外,它還管理整個系統中的塊副本:
它可以進行佈局決策,創建新的塊和複製品,並協調各種全系統的活動,以保持塊完全複製,平衡所有塊服務器的負載,並回收未使用的存儲。
我們現在討論這些主題。

4.1 Namespace Management and Locking
Many master operations can take a long time: 
for example, a snapshot operation has to revoke chunkserver leases on all chunks covered by the snapshot. 
We do not want to delay other master operations while they are running. 
Therefore, we allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.
Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory. 
Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms). 
GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. 
With prefix compression, this table can be efficiently represented in memory. 
Each node in the namespace tree (either an absolute file name or an absolute directory name) has an associated read-write lock.

4.1命名空間管理和鎖定
許多 master 操作可能需要很長時間:
例如,快照操作必須撤消快照覆蓋的所有塊上的chunkserver租約。
我們不想在運行時延遲其他主操作。
因此,我們允許多個操作是活動的,並在命名空間的區域使用鎖來確保正確的序列化。
與許多傳統文件系統不同,GFS沒有列出該目錄中所有文件的每目錄數據結構。
它也不支持同一個文件或目錄的別名(即Unix條款中的硬鏈接或符號鏈接)。
GFS在邏輯上表示其命名空間,作爲將完整路徑名映射到元數據的查找表。
使用前綴壓縮,可以在內存中高效地表示此表。
命名空間樹中的每個節點(絕對文件名或絕對目錄名)都具有關聯的讀寫鎖。


Each master operation acquires a set of locks before it runs. 
Typically, if it involves /d1/d2/.../dn/leaf, it will acquire read-locks on the directory names /d1, /d1/d2, ..., /d1/d2/.../dn, and either a read lock or a write lock on the full pathname /d1/d2/.../dn/leaf. 
Note that leaf may be a file or directory depending on the operation.
We now illustrate how this locking mechanism can prevent a file /home/user/foo from being created while /home/user is being snapshotted to /save/user. 
The snapshot operation acquires read locks on /home and /save, and write locks on /home/user and /save/user. 
The file creation acquires read locks on /home and /home/user, and a write lock on /home/user/foo. 
The two operations will be serialized properly because they try to obtain conflicting locks on /home/user. 
File creation does not require a write lock on the parent directory because there is no “directory”, or inode-like, data structure to be protected from modification.
The read lock on the name is sufficient to protect the parent directory from deletion.
One nice property of this locking scheme is that it allows concurrent mutations in the same directory. 
For example, multiple file creations can be executed concurrently in the same directory: 
each acquires a read lock on the directory name and a write lock on the file name. 
The read lock on the directory name suffices to prevent the directory from being deleted, renamed, or snapshotted. 
The write locks on file names serialize attempts to create a file with the same name twice.
Since the namespace can have many nodes, read-write lock objects are allocated lazily and deleted once they are not in use. 
Also, locks are acquired in a consistent total order to prevent deadlock: they are first ordered by level in the namespace tree and lexicographically within the same level.

每個 master 操作在運行之前獲取一組鎖。
通常,如果它涉及/d1/d2/.../dn/leaf,它將獲取目錄名稱/d1,/d1/d2,...,/d1/d2/.../dn上的讀鎖定,以及完整路徑名/d1/d2/.../dn/leaf上的讀鎖或寫鎖。
請注意,根據操作,葉可能是文件或目錄。
現在我們來說明這個鎖定機制如何防止/home/user被快照到/save/user時被創建的文件/home/user/foo。
快照操作在/home和/save上獲取讀鎖,並在/home/user和/save/user上寫入鎖。
文件創建在/home和/home/user上獲取讀鎖定,並在/home/user/foo上寫入一個鎖。
這兩個操作將被正確序列化,因爲它們嘗試在/home/user上獲取衝突的鎖。
文件創建不需要父目錄上的寫鎖定,因爲沒有“目錄”或類似於inode的數據結構可以免受修改。
該名稱上的讀鎖足以保護父目錄不被刪除。
這種鎖定方案的一個不錯的特性是它允許同一個目錄中的併發突變。
例如,可以在同一目錄中同時執行多個文件創建:
每個獲取目錄名上的讀鎖定和文件名上的寫鎖定。
目錄名稱上的讀鎖定足以阻止目錄被刪除,重命名或快照。
文件名上的寫入鎖定序列化嘗試創建兩個相同名稱的文件。
由於命名空間可以擁有多個節點,所以讀寫鎖定對象在不被使用的情況下被延遲分配並被刪除。
此外,鎖是以一致的總順序獲取的,以防止死鎖:它們首先按命名空間樹中的級別排序,並在同一級別中按字典順序排列。

4.2 Replica Placement
A GFS cluster is highly distributed at more levels than one. 
It typically has hundreds of chunkservers spread across many machine racks. 
These chunkservers in turn may be accessed from hundreds of clients from the same or different racks. 
Communication between two machines on different racks may cross one or more network switches. 
Additionally, bandwidth into or out of a rack may be less than the aggregate bandwidth of all the machines within the rack.
Multi-level distribution presents a unique challenge to distribute data for scalability, reliability, and availability.
The chunk replica placement policy serves two purposes:
maximize data reliability and availability, and maximize network bandwidth utilization. 
For both, it is not enough to spread replicas across machines, which only guards against disk or machine failures and fully utilizes each machine’s network bandwidth. 
We must also spread chunk replicas across racks. 
This ensures that some replicas of a chunk will survive and remain available even if an entire rack is damaged or offline (for example, due to failure of a shared resource like a network switch or power circuit). 
It also means that traffic, especially reads, for a chunk can exploit the aggregate bandwidth of multiple racks. 
On the other hand, write traffic has to flow through multiple racks, a tradeoff we make willingly.

4.2複製放置
一個GFS集羣的分佈水平高於一個。
它通常有數百個chunkservers分佈在許多機架上。
這些chunkserver又可以從相同或不同機架的數百個客戶端進行訪問。
不同機架上的兩臺機器之間的通訊可能會交叉一個或多個網絡交換機。
此外,進入或退出機架的帶寬可能小於機架內所有機器的總帶寬。
多層次分發是分發可擴展性,可靠性和可用性數據的獨特挑戰。
塊副本放置策略有兩個目的:
最大化數據可靠性和可用性,並最大限度地提高網絡帶寬利用率
對於這兩者來說,僅僅在防止磁盤或機器故障並充分利用每臺機器的網絡帶寬的機器上傳播副本是不夠的。
我們還必須在機架上傳播大塊副本。
這確保了一個塊的一些副本將能夠存活並保持可用,即使整個機架被損壞或脫機(例如,由於諸如網絡交換機或電源電路的共享資源的故障)。
這也意味着一個塊的流量,特別是讀取可以利用多個機架的總帶寬。
另一方面,寫入流量必須流經多個機架,這是我們自願做出的權衡。

4.3 Creation, Re-replication, Rebalancing
Chunk replicas are created for three reasons: 
chunk creation, re-replication, and rebalancing.
When the master creates a chunk, it chooses where to place the initially empty replicas. 
It considers several factors. 
(1) We want to place new replicas on chunkservers with below-average disk space utilization. 
    Over time this will equalize disk utilization across chunkservers. 
(2) We want to limit the number of “recent” creations on each chunkserver.
Although creation itself is cheap, it reliably predicts imminent heavy write traffic because chunks are created when demanded by writes, and in our append-once-read-many work-load they typically become practically read-only once they have been completely written. 
(3) As discussed above, we want to spread replicas of a chunk across racks.
The master re-replicates a chunk as soon as the number of available replicas falls below a user-specified goal. 
This could happen for various reasons: 
a chunkserver becomes unavailable, it reports that its replica may be corrupted, one of its disks is disabled because of errors, or the replication goal is increased. 
Each chunk that needs to be re-replicated is prioritized based on several factors. 
One is how far it is from its replication goal. 
For example, we give higher priority to a chunk that has lost two replicas than to a chunk that has lost only one. 
In addition, we prefer to first re-replicate chunks for live files as opposed to chunks that belong to recently deleted files (see Section 4.4). 
Finally, to minimize the impact of failures on running applications, we boost the priority of any chunk that is blocking client progress.

The master picks the highest priority chunk and “clones” it by instructing some chunkserver to copy the chunk data directly from an existing valid replica. 
The new replica is placed with goals similar to those for creation: 
equalizing disk space utilization, limiting active clone operations on any single chunkserver, and spreading replicas across racks.
To keep cloning traffic from overwhelming client traffic, the master limits the numbers of active clone operations both for the cluster and for each chunkserver. 
Additionally, each chunkserver limits the amount of bandwidth it spends on each clone operation by throttling its read requests to the source chunkserver.
Finally, the master rebalances replicas periodically: 
it examines the current replica distribution and moves replicas for better disk space and load balancing. 
Also through this process, the master gradually fills up a new chunkserver rather than instantly swamps it with new chunks and the heavy write traffic that comes with them. 
The placement criteria for the new replica are similar to those discussed above. 
In addition, the master must also choose which existing replica to remove. 
In general, it prefers to remove those on chunkservers with below-average free space so as to equalize disk space usage.

4.3創建,重新複製,重新平衡
創建塊副本有三個原因:
塊創建,重新複製和重新平衡。
當 master 創建一個塊時,它選擇放置最初空的副本的位置。
它考慮了幾個因素。
(1)我們要在具有低於平均磁盤空間利用率的塊服務器上放置新的副本。隨着時間的推移,這將平衡塊服務器的磁盤利用率。
(2)我們要限制每個chunkserver的“最近”創作的數量。
雖然創建本身便宜,但它可靠地預測即將發生的大量寫入流量,因爲在寫入時需要創建塊,而在我們的附加讀取許多工作負載中,一旦完全寫入,它們通常會變爲實用的只讀。
(3)如上所述,我們想把一個塊的副本分佈在機架上。
只要可用副本的數量低於用戶指定的目標,master 將重新複製一個塊。
這可能是由於各種原因:
chunkserver變得不可用,它報告其副本可能已損壞,其中一個磁盤由於錯誤而被禁用,或複製目標增加。
每個需要重新複製的塊都是基於幾個因素來確定的。
一個是它的複製目標有多遠。
例如,我們給一個丟失了兩個副本的塊優先於一個只丟失了一個的塊。
此外,我們更喜歡首先重新複製實時文件的塊,而不是屬於最近刪除的文件的塊(參見第4.4節)。
最後,爲了最大限度地減少運行應用程序的失敗影響,我們提高阻止客戶端進度的任何塊的優先級。

 master 選擇最高優先級的塊,並通過指示一些塊服務器直接從現有的有效副本複製塊數據來“克隆”。
新的副本的目標與創作的目標相似:
平衡磁盤空間利用率,限制任何單個塊服務器上的活動克隆操作,以及跨機架擴展副本。
爲了將流量克服壓倒性的客戶端流量,主機限制了羣集和每個chunkserver的主動克隆操作的數量。
此外,每個chunkserver通過將其讀請求限制到源chunkserver來限制每個克隆操作花費的帶寬量。
最後,master 定期重新平衡複製品:
它檢查當前的副本分發和移動副本以獲得更好的磁盤空間和負載平衡。
同樣通過這個過程,master 逐漸填補了一個新的chunkserver,而不是立即用新的塊和它​​們附帶的繁重的寫入流量。
新副本的放置標準與上述相似。
此外,master 還必須選擇要刪除的現有副本。
一般來說,它更喜歡刪除那些具有低於平均可用空間的塊服務器,以便平衡磁盤空間的使用。

4.4 Garbage Collection
After a file is deleted, GFS does not immediately reclaim the available physical storage. 
It does so only lazily during regular garbage collection at both the file and chunk levels.
We find that this approach makes the system much simpler and more reliable.

4.4垃圾收集
刪除文件後,GFS不會立即回收可用的物理存儲。
它在文件和塊級別的常規垃圾收集過程中這樣做只是懶惰。
我們發現這種方法使系統更簡單和更可靠。

4.4.1 Mechanism
When a file is deleted by the application, the master logs the deletion immediately just like other changes. 
However instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion timestamp. 
During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days (the interval is configurable).
Until then, the file can still be read under the new, special name and can be undeleted by renaming it back to normal.
When the hidden file is removed from the namespace, its in-memory metadata is erased. 
This effectively severs its links to all its chunks.
In a similar regular scan of the chunk namespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks. 
In a HeartBeat message regularly exchanged with the master, each chunkserver reports a subset of the chunks it has, and the master replies with the identity of all chunks that are no longer present in the master’s metadata. 
The chunkserver is free to delete its replicas of such chunks.

4.4.1機制
當應用程序刪除文件時,master 站會像其他更改一樣立即記錄刪除。
但是,不是立即回收資源,而是將文件重命名爲包含刪除時間戳的隱藏名稱。
在master 服務器定期掃描文件系統名稱空間期間,如果這些隱藏文件已存在超過三天(間隔可配置),它將刪除這些隱藏文件。
在此之前,文件仍然可以以新的特殊名稱讀取,並且可以通過將其重命名爲正常來取消刪除。
隱藏文件從名稱空間中刪除時,其內存中的元數據將被刪除。
這有效地切斷了與其所有塊的聯繫。
在組塊命名空間的類似的常規掃描中,master 識別孤立的塊(即,從任何文件不可訪問的塊),並擦除這些塊的元數據。
在與 master 定期交換的HeartBeat消息中,每個chunkserver報告其擁有的塊的一個子集,並且主節點對主節點元數據中不再存在的所有塊的身份進行回覆。
chunkserver可以自由地刪除這些塊的副本。

4.4.2 Discussion
Although distributed garbage collection is a hard problem that demands complicated solutions in the context of programming languages, it is quite simple in our case. 
We can easily identify all references to chunks: they are in the file-to-chunk mappings maintained exclusively by the master.
We can also easily identify all the chunk replicas: 
they are Linux files under designated directories on each chunkserver.
Any such replica not known to the master is “garbage.”

The garbage collection approach to storage reclamation offers several advantages over eager deletion. 
First, it is simple and reliable in a large-scale distributed system where component failures are common. 
Chunk creation may succeed on some chunkservers but not others, leaving replicas that the master does not know exist. 
Replica deletion messages may be lost, and the master has to remember to resend them across failures, both its own and the chunkserver’s.
Garbage collection provides a uniform and dependable way to clean up any replicas not known to be useful. 
Second,it merges storage reclamation into the regular background activities of the master, such as the regular scans of namespaces and handshakes with chunkservers. 
Thus, it is done in batches and the cost is amortized. 
Moreover, it is done only when the master is relatively free. 
The master can respond more promptly to client requests that demand timely attention. 
Third, the delay in reclaiming storage provides a safety net against accidental, irreversible deletion.
In our experience, the main disadvantage is that the delay sometimes hinders user effort to fine tune usage when storage is tight. 
Applications that repeatedly create and delete temporary files may not be able to reuse the storage right away. 
We address these issues by expediting storage reclamation if a deleted file is explicitly deleted again. 
We also allow users to apply different replication and reclamation policies to different parts of the namespace. 
For example, users can specify that all the chunks in the files within some directory tree are to be stored without replication, and any
deleted files are immediately and irrevocably removed from the file system state.

4.4.2討論
雖然分佈式垃圾收集是一個難題,需要在編程語言的上下文中使用複雜的解決方案,但在我們的案例中相當簡單。
我們可以很容易地識別對塊的所有引用:它們是由 master 唯一維護的文件到塊的映射。
我們還可以輕鬆識別所有的塊複本:
它們是每個chunkserver上指定目錄下的Linux文件。
master 不知道的任何這樣的複製品是“垃圾”。

存儲回收的垃圾收集方法提供了超過急切刪除的幾個優點。
首先,在大型分佈式系統中,組件故障是常見的,它是簡單可靠的。
塊創建可能會在某些chunkserver上成功,但不能成功,留下 master 不知道的副本。
副本刪除消息可能會丟失,並且主服務器必須記住重新發送它們,包括它自己的和chunkserver的故障。
垃圾收集提供統一和可靠的方式來清理任何不被認爲有用的副本。
第二,它將存儲回收合併到主服務器的常規後臺活動中,例如定期掃描命名空間和使用chunkserver進行握手。
因此,分批完成,成本攤銷。
此外,只有當 master 相對自由時,才能完成。
master 可以更快地迴應需要及時關注的客戶端請求。
第三,回收存儲的延遲提供了一個安全網,防止意外,不可逆的刪除。
根據我們的經驗,主要缺點是延遲有時會阻礙用戶在存儲緊張時微調使用情況。
反覆創建和刪除臨時文件的應用程序可能無法立即重新使用存儲。
如果再次明確刪除已刪除的文件,我們將通過加快存儲回收來解決這些問題。
我們還允許用戶將不同的複製和回收策略應用於命名空間的不同部分。
例如,用戶可以指定某個目錄樹中的文件中的所有塊都將被存儲而不復制,以及任何
已刪除的文件將立即且不可撤銷地從文件系統狀態中刪除。

4.5 Stale Replica Detection
Chunk replicas may become stale if a chunkserver fails and misses mutations to the chunk while it is down. 
For each chunk, the master maintains a chunk version number to distinguish between up-to-date and stale replicas.
Whenever the master grants a new lease on a chunk, it increases the chunk version number and informs the up-to-date replicas. 
The master and these replicas all record the new version number in their persistent state. 
This occurs before any client is notified and therefore before it can start writing to the chunk. 
If another replica is currently unavailable, its chunk version number will not be advanced. 
The master will detect that this chunkserver has a stale replica when the chunkserver restarts and reports its set of chunks and their associated version numbers. 
If the master sees a version number greater than the one in its records, the master assumes that it failed when granting the lease and so
takes the higher version to be up-to-date.
The master removes stale replicas in its regular garbage collection. 
Before that, it effectively considers a stale replica not to exist at all when it replies to client requests for chunk information. 
As another safeguard, the master includes the chunk version number when it informs clients which chunkserver holds a lease on a chunk or when it instructs a chunkserver to read the chunk from another chunkserver in a cloning operation. 
The client or the chunkserver verifies the version number when it performs the operation so that it is always accessing up-to-date data.

4.5穩定的副本檢測
如果chunkserver失敗,Chunk副本可能會變得陳舊,並且在下載時錯過了該塊的突變。
對於每個組塊, master 維護一個塊版本號,以區分最新和過時的副本。
每當 master 在一個大塊上授予新的租約時,它會增加塊版本號並通知最新的副本。
 master 和這些副本都將新版本號記錄在其持續狀態。
這在任何客戶端被通知之前發生,因此在它可以開始寫入塊之前。
如果另一個副本當前不可用,則其塊版本號將不會提前。
 master 將在chunkserver重新啓動並報告其組塊及其相關版本號時檢測到此chunkserver具有陳舊的副本。
如果 master 看到的版本號大於其記錄中的版本號,則master 假定在授予租期時失敗
使更高版本更新。
master 刪除其常規垃圾回收中的陳舊副本。
在此之前,它有效地考慮到一個陳舊的副本根本不存在,當它回覆客戶端請求的塊信息。
作爲另一個保護措施,當master服務器通知客戶端哪個chunkserver持有一個chunk的租約或指示chunkserver在克隆操作中從另一個chunkserver讀取該chunk時,主包括該塊的版本號。
客戶端或chunkserver在執行操作時驗證版本號,以便始終訪問最新的數據。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章