Bigtable: A Distributed Storage System for Structured Data : part5 Implementation

5 Implementation
The Bigtable implementation has three major components:
a library that is linked into every client, one master server, and many tablet servers.
Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.

The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS.
In addition, it handles schema changes such as table and column family creations.
Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server).
The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large.
As with many single-master distributed storage systems , client data does not move through the master:
clients communicate directly with tablet servers for reads and writes.
Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master.
As a result, the master is lightly loaded in practice.
A Bigtable cluster stores a number of tables.
Each table consists of a set of tablets, and each tablet contains all data associated with a row range.
Initially, each table consists of just one tablet. As a table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default.

5實施
Bigtable實施有三個主要部分：
連接到每個客戶端，一個主服務器和許多tablet服務器的庫。
tablet服務器可以從集羣中動態添加（或刪除）以適應工作負載的變化。
主管負責將tablet分配給tablet服務器，檢測tablet服務器的添加和到期，平衡服務器負載以及GFS中文件的垃圾收集。
此外，它處理模式更改，如表和列族創建。
每個tablet服務器管理一組tablet（通常我們每個tablet服務器的平均數爲10到1000個）。
tablet服務器處理對已加載的tablet的讀取和寫入請求，並且還分割了增長太大的tablet。
與許多單主分佈式存儲系統一樣，客戶端數據不會通過主控：
客戶端直接與tablet服務器通信進行讀寫。
由於Bigtable客戶端不依賴主機的tablet位置信息，因此大多數客戶端從不與主機通信。
結果，主人在實踐中輕裝。BigTable集羣存儲多個表。
每個表由一組tablet組成，每個tablet包含與行範圍相關聯的所有數據。

最初，每張桌子只有一個tablet。隨着桌面的增長，它會自動分爲多個tablet，默認大約爲100-200 MB。

5.1 Tablet Location
We use a three-level hierarchy analogous to that of a B + tree to store tablet location information (Figure 4).

Figure 4: Tablet location hierarchy.
The first level is a file stored in Chubby that contains the location of the root tablet.
The root tablet contains the location of all tablets in a special METADATA table.
Each METADATA tablet contains the location of a set of user tablets.
The root tablet is just the first tablet in the METADATA table, but is treated specially—it is never split—to ensure that the tablet location hierarchy has no more than three levels.
The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table identifier and its end row.
Each METADATA row stores approximately 1KB of data in memory.
With a modest limit of 128 MB METADATA tablets, our three-level location scheme is sufficient to address 2 34 tablets (or 2 61 bytes in 128 MB tablets).
The client library caches tablet locations.
If the client does not know the location of a tablet, or if it discovers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy.
If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby.
If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently).
Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations:
it reads the metadata for more than one tablet whenever it reads the METADATA table.

We also store secondary information in the METADATA table, including a log of all events pertaining to each tablet (such as when a server begins serving it).
This information is helpful for debugging and performance analysis.

5.1 Tablet 位置
我們使用類似於B +樹的三級層級來存儲Tablet的位置信息（圖4）。
圖4：Tablet位置層級。
第一級是存儲在Chubby中的文件，其中包含根Tablet的位置。
根Tablet包含特殊METADATA表中所有Tablet的位置。
每個METADATA Tablet都包含一組用戶 Tablet 的位置。
根 Tablet 只是METADATA表格中的第一個 Tablet ，但經過特別處理，絕不會被拆分，以確保 Tablet 位置層級結構不超過三個等級。
METADATA表存儲 Tablet 在行鍵下的位置，該鍵是 Tablet 表標識符及其結束行的編碼。
每個METADATA行將大約1KB的數據存儲在內存中。
限制128 MB的METADATA Tablet ，我們的三級位置方案足以解決2 34個 Tablet （或128 MB Tablet 中的2 61個字節）。
客戶端庫緩存 Tablet 位置。
如果客戶端不知道 Tablet 的位置，或者如果它發現緩存的位置信息不正確，則它遞歸地向上移動 Tablet 位置層次結構。
如果客戶端的緩存爲空，則位置算法需要三次網絡往返，包括從Chubby讀取的一次。
如果客戶端的緩存過期，則位置算法最多可能需要六次往返，因爲只有在錯過時纔會發現暫停緩存條目（假設METADATA平臺不會非常頻繁地移動）。
雖然 Tablet 位置存儲在內存中，因此不需要GFS訪問，我們通過讓客戶端庫預取 Tablet 位置進一步降低常見情況下的成本：
它讀取METADATA表時讀取多於一個 Tablet 的元數據。

我們還在METADATA表中存儲輔助信息，包括與每個 Tablet 相關的所有事件的日誌（例如服務器開始服務時）。
此信息有助於調試和性能分析。

5.2 Tablet Assignment

Each tablet is assigned to one tablet server at a time.
The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers, including which tablets are unassigned.
When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.
Bigtable uses Chubby to keep track of tablet servers.
When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory.
The master monitors this directory (the servers directory) to discover tablet servers.
A tablet server stops serving its tablets if it loses its exclusive lock: e.g., due to a network partition that caused the server to lose its Chubby session.
(Chubby provides an efficient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traffic.)
A tablet server will attempt to reacquire an exclusive lock on its file as long as the file still exists.
If the file no longer exists, then the tablet server will never be able to serve again, so it kills itself.
Whenever a tablet server terminates (e.g., because the cluster management system is removing the tablet server’s machine from the cluster), it attempts to release its lock so that the master will reassign its tablets more quickly.

5.2 Tablet 分配
每個 tablet 一次分配給一臺 tablet 服務器。
主人跟蹤一組現場 tablet 服務器，以及當前將 tablet 分配給 tablet 服務器，包括哪些 tablet 未分配。
取消分配 tablet 時， tablet 服務器有足夠的 tablet 空間可用，主人通過向 tablet 服務器發送 tablet 加載請求來分配 tablet 。
Bigtable使用Chubby跟蹤 Tablet 服務器。
當 Tablet 服務器啓動時，它創建並獲取特定Chubby目錄中唯一命名的文件的獨佔鎖定。
主控器監視此目錄（服務器目錄）以發現 Tablet 服務器。
Tablet 服務器如果丟失其排他鎖，則停止服務其 Tablet ：例如，由於導致服務器丟失其Chubby會話的網絡分區。
（Chubby提供了一種有效的機制，允許 Tablet 服務器檢查它是否仍然保持鎖定，而不會招致網絡流量。）
只要文件仍然存在， Tablet 服務器將嘗試重新獲取其文件上的排他鎖。
如果文件不再存在，那麼 Tablet 服務器將永遠不能再次服務，所以它會自動殺死。
每當 Tablet 服務器終止（例如，因爲集羣管理系統從集羣中移除 Tablet 服務器的計算機）時，它會嘗試釋放其鎖定，以便主人能夠更快地重新分配其 Tablet 。

The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible.
To detect when a tablet server is no longer serving its tablets, the master periodically asks each tablet server for the status of its lock.
If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts, the master attempts to acquire an exclusive lock on the server’s file.
If the master is able to acquire the lock, then Chubby is live and the tablet server is either dead or having trouble reaching Chubby, so the master ensures that the tablet server can never serve again by deleting its server file.
Once a server’s file has been deleted, the master can move all the tablets that were previously assigned to that server into the set of unassigned tablets.
To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.
However, as described above, master failures do not change the assignment of tablets to tablet servers.

主管負責檢測 Tablet 服務器何時不再服務其 Tablet ，並儘快重新分配這些 Tablet 。
要檢測 Tablet 服務器何時不再投放 Tablet ，主人會定期詢問每臺 Tablet 服務器的鎖定狀態。
如果 Tablet 服務器報告其已失去鎖定，或者主服務器在最近幾次嘗試期間無法訪問服務器，則主機會嘗試獲取服務器文件上的排他鎖。
如果主人能夠獲得鎖定，那麼Chubby是直播， Tablet 服務器已經死機或遇到麻煩，所以主人確保 Tablet 服務器永遠不能通過刪除其服務器文件再次服務。
一旦服務器的文件被刪除，主人可以將先前分配給該服務器的所有 Tablet 移動到未分配的 Tablet 中。
爲了確保BigTable集羣不容易受到主機和Chubby之間的網絡問題的影響，如果主機和Chubby會話過期，則主機會自動死機。
但是，如上所述，主故障不會將 Tablet 的分配更改爲 Tablet 服務器。

When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them.
The master executes the following steps at startup.
(1) The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations.
(2) The master scans the servers directory in Chubby to find the live servers.
(3) The master communicates with every live tablet server to discover what tablets are already assigned to each server.
(4) The master scans the METADATA table to learn the set of tablets.
Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.
One complication is that the scan of the METADATA table cannot happen until the METADATA tablets have been assigned.
Therefore, before starting this scan (step 4), the master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet was not discovered during step 3.

當主機由集羣管理系統啓動時，它需要發現當前的 Tablet 分配，然後才能更改它們。
主機在啓動時執行以下步驟。
（1）主機在Chubby中抓取一個唯一的主鎖，可防止併發主實例。
（2）主服務器掃描Chubby中的服務器目錄以查找實時服務器。
（3）主人與每個實時 Tablet 服務器進行通信，以發現已分配給每個服務器的 Tablet 。
（4）主人掃描METADATA表以瞭解 Tablet 。
無論何時此掃描遇到尚未分配的 Tablet ，主人都會將 Tablet 添加到未分配的 Tablet ，這樣 Tablet 可以進行 Tablet 分配。
一個複雜的情況是，在分配METADATA Tablet 之前，不能發生METADATA表的掃描。
因此，在開始此掃描之前（步驟4），如果在步驟3中沒有發現根 Tablet 的分配，則主機會將根 Tablet 添加到未分配的 Tablet 。

This addition ensures that the root tablet will be assigned.
Because the root tablet contains the names of all METADATA tablets, the master knows about all of them after it has scanned the root tablet.
The set of existing tablets only changes when a table is created or deleted, two existing tablets are merged to form one larger tablet, or an existing tablet is split into two smaller tablets.
The master is able to keep track of these changes because it initiates all but the last.
Tablet splits are treated specially since they are initiated by a tablet server.
The tablet server commits the split by recording information for the new tablet in the METADATA table.
When the split has committed, it notifies the master.
In case the split notification is lost (either because the tablet server or the master died), the master detects the new tablet when it asks a tablet server to load the tablet that has now split.
The tablet server will notify the master of the split, because the tablet entry it finds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

此添加確保根 Tablet 將被分配。
因爲根 Tablet 包含所有METADATA Tablet 的名稱，所以主人在掃描根 Tablet 後就會知道所有的 Tablet 。
現有 Tablet 只能在創建或刪除表格時更改，兩個現有的 Tablet 合併形成一個較大的 Tablet ，或者現有的 Tablet 分爲兩個較小的 Tablet 。
主人能夠跟蹤這些變化，因爲它啓動了所有的，但最後一個。
Tablet 拆分由於由 Tablet 服務器發起，因此被特別處理。
Tablet 服務器通過在METADATA表中記錄新 Tablet 的信息來提交拆分。
當分裂已經提交時，它通知主人。
如果拆分通知丟失（由於 Tablet 服務器或主機死機），則主機會在要求 Tablet 服務器加載現在已拆分的 Tablet 時檢測新的 Tablet 。
Tablet 服務器將通知主機的拆分，因爲它在METADATA表中找到的 Tablet 條目只會指定主機要求加載的 Tablet 的一部分。

5.3 Tablet Serving
The persistent state of a tablet is stored in GFS, as illustrated in Figure 5.

Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables.
To recover a tablet, a tablet server reads its metadata from the METADATA table.
This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet.
The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.
When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation.
Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache).
A valid mutation is written to the commit log.
Group commit is used to improve the throughput of lots of small mutations .
After the write has been committed, its contents are inserted into the memtable.
When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization.
A valid read operation is executed on a merged view of the sequence of SSTables and the memtable.
Since the SSTables and the memtable are lexicograph-ically sorted data structures, the merged view can be formed efficiently.
Incoming read and write operations can continue while tablets are split and merged.

5.3 Tablet 服務
Tablet 的持續狀態存儲在GFS中，如圖5所示。
更新致力於存儲重做記錄的提交日誌。在這些更新中，最近提交的更新存儲在名爲memtable的排序緩衝區中的內存中;較舊的更新存儲在一系列SSTables中。
要恢復 Tablet ， Tablet 服務器從METADATA表中讀取其元數據。
此元數據包含構成 Tablet 的SSTables和一組重做點的列表，這些重點指針指向可能包含 Tablet 數據的任何提交日誌。
服務器將SSTables的索引讀取到內存中，並通過應用重做點以來已經提交的所有更新重建memtable。
當寫入操作到達 Tablet 服務器時，服務器檢查它是否正確，並且發送者被授權執行突變。
通過從Chubby文件（幾乎總是在Chubby客戶端緩存中命中）讀取允許的寫入器列表來執行授權。
將一個有效的變量寫入提交日誌。
小組承諾用於提高大量小突變的吞吐量。
寫入完成後，將其內容插入到memtable中。
當讀取操作到達 Tablet 服務器時，類似地檢查其形狀和正確的授權。
在SSTables和memtable的順序的合併視圖上執行有效的讀取操作。
由於SSTables和memtable是按字典排序的數據結構，因此可以有效地形成合並視圖。
Tablet 被拆分併合並時，可以繼續進行讀寫操作。

5.4 Compactions
As write operations execute, the size of the memtable increases.
When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.
This minor compaction process has two goals:
it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies.
Incoming read and write operations can continue while compactions occur.
Every minor compaction creates a new SSTable.

5.4壓實
當寫入操作執行時，memtable的大小增加。
當memtable大小達到閾值時，memtable被凍結，創建一個新的memtable，並將凍結的memtable轉換爲SSTable並寫入GFS。
這個輕微的壓實過程有兩個目標：
它會縮小 Tablet 服務器的內存使用情況，並減少在恢復期間必須從提交日誌中讀取的數據量（如果此服務器死機）。
進行讀寫操作可以在壓縮發生時繼續。
每個小的壓縮都會創建一個新的SSTable。

If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables.
Instead, we bound the number of such files by periodically executing a merging compaction in the background.
A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable.
The input SSTables and memtable can be discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction.
SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live.
A major compaction,on the other hand, produces an SSTable that contains no deletion information or deleted data.
Bigtable cycles through all of its tablets and regularly applies major compactions to them.
These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.

如果此行爲繼續未經檢查，讀取操作可能需要合併來自任意數量的SSTables的更新。
相反，我們通過在後臺定期執行合併壓縮來限制這些文件的數量。
合併壓縮讀取幾個SSTables和memtable的內容，並寫出一個新的SSTable。
一旦壓實完成，輸入SSTables和memtable就可以被丟棄。
將所有SSTables重寫爲一個SSTable的合併壓縮稱爲主壓縮。
由非主要壓縮生成的SSTables可以包含特殊的刪除條目，可以抑制仍舊存在的舊SST中的已刪除數據。
另一方面，主要壓縮生成不包含刪除信息或刪除數據的SSTable。
Bigtable循環遍及其所有 Tablet ，並定期對其進行主要的壓縮。
這些主要的壓縮使BigTable可以回收被刪除的數據所使用的資源，並且還允許它確保已刪除的數據及時從系統中消失，這對於存儲敏感數據的服務很重要。