詳解SSTable結構和LSMTree索引

The Sorted String Table (SSTable) is one of the most popular outputs for storing, processing, and exchanging datasets.
An SSTable is a simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.

Unfortunately, the SSTable name itself has also been overloaded by the industry to refer to services that go well beyond just the sorted table, which has only added unnecessary confusion to what is a very simple and a useful data structure on its own. Let's take a closer look under the hood of an SSTable and how LevelDB makes use of it.

SSTable: Sorted String Table

SSTable本身是個簡單而有用的數據結構, 而往往由於工業界對於它的overload, 導致大家的誤解
它本身就像他的名字一樣, 就是a set of sorted key-value pairs
如下圖左, 當文件比較大的時候, 也可以建立key:offset的index, 用於快速分段定位, 但這個是可選的.

這個結構和普通的key-value pairs的區別, 可以support range query和random r/w

A "Sorted String Table" then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside.
Duplicate keys are fine, there is no need for "padding" for keys or values, and keys and values are arbitrary blobs. Read in the entire file sequentially and you have a sorted index. Optionally, if the file is very large, we can also prepend, or create a standalone key:offset index for fast access.

That's all an SSTable is: very simple, but also a very useful way to exchange large, sorted data segments.

SSTables and Log Structured Merge Trees

僅僅SSTable數據結構本身仍然無法support高效的range query和random r/w的場景
還需要一整套的機制來完成從memory sort, flush to disk, compaction以及快速讀取……這樣的一個完成的機制和架構稱爲,"The Log-Structured Merge-Tree" (LSM Tree)
名字很形象, 首先是基於log的, 不斷產生SSTable結構的log文件, 並且是需要不斷merge以提高效率的

下圖很好的描繪了LSM Tree的結構和大部分操作

We want to preserve the fast read access which SSTables give us, but we also want to support fast random writes. Turns out, we already have all the necessary pieces: random writes are fast when the SSTable is in memory (let's call it MemTable), and if the table is immutable then an on-disk SSTable is also fast to read from. Now let's introduce the following conventions:

On-disk SSTable indexes are always loaded into memory
All writes go directly to the MemTable index
Reads check the MemTable first and then the SSTable indexes
Periodically, the MemTable is flushed to disk as an SSTable
Periodically, on-disk SSTables are "collapsed together"

What have we done here? Writes are always done in memory and hence are always fast. Once the MemTable reaches a certain size, it is flushed to disk as an immutable SSTable. However, we will maintain all the SSTable indexes in memory, which means that for any read we can check the MemTable first, and then walk the sequence of SSTable indexes to find our data. Turns out, we have just reinvented the "The Log-Structured Merge-Tree" (LSM Tree), described by Patrick O'Neil, and this is also the very mechanism behind "BigTable Tablets".

LSM & SSTables: Updates, Deletes and Maintenance

This "LSM" architecture provides a number of interesting behaviors: writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek. However, what about updates and deletes?

Once the SSTable is on disk, it is immutable, hence updates and deletes can't touch the data.
Instead, a more recent value is simply stored in MemTable in case of update, and a "tombstone" record (不能直接刪除,標上已deleted) is appended for deletes.
Because we check the indexes in sequence, future reads will find the updated or the tombstone record without ever reaching the older values!
Finally, having hundreds of on-disk SSTables is also not a great idea, hence periodically we will run a process to merge the on-disk SSTables, at which time the update and delete records will overwrite and remove the older data.

SSTables and LevelDB

Take an SSTable, add a MemTable and apply a set of processing conventions and what you get is a nice database engine for certain type of workloads.
In fact, Google's BigTable, Hadoop's HBase, and Cassandra amongst others are all using a variant or a direct copy of this very architecture.

Simple on the surface, but as usual, implementation details matter a great deal. Thankfully, Jeff Dean and Sanjay Ghemawat, the original contributors to the SSTable and BigTable infrastructure at Google released LevelDB earlier last year, which is more or less an exact replica of the architecture we've described above:

SSTable under the hood, MemTable for writes
Keys and values are arbitrary byte arrays
Support for Put, Get, Delete operations
Forward and backward iteration over data
Built-in Snappy compression

LevelDB中LSM Tree的實現細節

http://blog.csdn.net/anderscloud/article/details/7182165, LevelDB設計與實現

整體架構

由於LevelDB是開源的, 所以從中可以瞭解到更多的SSTable和LSM tree的實現細節
LevelDb作爲存儲系統，其中核心就是SSTable, 下面先看看SSTable在LevelDb中的結構是怎樣的...

內存
Memtable, Immutable Memtable
寫入首先寫入Memtable, 當Memtable插入的數據佔用內存到了一個界限後，需要將內存的記錄導出到外存文件中.
生成新的Log文件和Memtable，原先的Memtable就成爲Immutable Memtable，顧名思義，就是說這個Memtable的內容是不可更改的，只能讀不能寫入或者刪除。新到來的數據被記入新的Log文件和Memtable，LevelDb後臺調度會將Immutable Memtable的數據導出到磁盤，形成一個新的SSTable文件.

磁盤

主要是多level的SSTable(levelDB也由此得名), 每個SSTable文件, 以.sst爲後綴, 並且文件內的key:value都是按key排序的, 局部有序

Level 0 SSTable, 由Immutable Memtable進行minor compaction得到. 所以level 0比較特殊, SSTable files之間的key range會有重合, 因爲是從Memtable compaction生成, 所以無法保證不重合

其他level SSTable, 由上級的SSTable進行major compaction得到, 比如level 1是由level 0 compaction得到

不斷把多個低級別的SSTable, compaction到一個高級別的SSTable, 目的是提高讀效率, 因爲如果需要打開很多的SSTable進行查詢, 明顯效率會很低. 而經過多level的compaction, 來刪除掉一些不再有效的KV數據, 減小數據規模, 減少文件數量等, 使效率大大提高.

Bigtable中講到三種類型的compaction: minor, major和full。所謂minor Compaction，就是把memtable中的數據導出到SSTable文件中；major compaction就是合併不同層級的SSTable文件，而full compaction就是將所有SSTable進行合併。LevelDb包含其中兩種，minor和major。

除了SSTable文件外, 還有3種files,

log文件, 防止數據丟失的

當應用寫入一條Key:Value記錄的時候，LevelDb會先往log文件裏寫入，成功後將記錄插進Memtable中，這樣基本就算完成了寫入操作，因爲一次寫入操作只涉及一次磁盤順序寫和一次內存寫入，所以這是爲何說LevelDb寫入速度極快的主要原因。

Manifest文件，記錄各個SSTable文件的元數據, 哪一層,範圍

Current文件, 記錄當前的manifest文件名

因爲在LevleDb的運行過程中，隨着Compaction的進行，SSTable文件會發生變化，會有新的文件產生，老的文件被廢棄，Manifest也會跟着反映這種變化，此時往往會新生成Manifest文件來記載這種變化，而Current則用來指出哪個Manifest文件纔是我們關心的那個Manifest文件。

log文件結構

LevelDb對於一個log文件，會把它切割成以32K爲單位的物理Block，每次讀取的單位以一個Block作爲基本讀取單位. 爲什麼要分block? 應該出於磁盤讀取效率的考慮

記錄如果在一個block裏面就可以放下, 那麼Type就是full, 如A, C

記錄如果需要多個block纔可以放下, 那麼Type分別是, First, Middle, Last 如B

至於每條record的邏輯結構如下,

SSTable文件結構

SSTable, .sst文件的結構, 也是分成固定大小的block.
除了大部分Data blocks外, 在文件的末端, 還會有一些用於數據管理的blocks…

其中比較重要的是, block index, 這個用於有效的提高讀效率, 尤其當SSTable比較大的時候

索引的結構如下圖, 也很簡單

其他細節,參考原文

MemTable詳解

LevelDb的MemTable提供了將KV數據寫入，刪除以及讀取KV記錄的操作接口，但是事實上Memtable並不存在真正的刪除操作,刪除某個Key的Value在Memtable內是作爲插入一條記錄實施的，但是會打上一個Key的刪除標記，真正的刪除操作是Lazy的，會在以後的Compaction過程中去掉這個KV。

需要注意的是，LevelDb的Memtable中KV對是根據Key大小有序存儲的，在系統插入新的KV時，LevelDb要把這個KV插到合適的位置上以保持這種Key有序性。其實，LevelDb的Memtable類只是一個接口類，真正的操作是通過背後的SkipList來做的，包括插入操作和讀取操作等，所以Memtable的核心數據結構是一個SkipList。

SkipList是由William Pugh發明。他在Communications of the ACM June 1990, 33(6) 668-676 發表了Skip lists: a probabilistic alternative to balanced trees，在該論文中詳細解釋了SkipList的數據結構和插入刪除操作。SkipList是平衡樹的一種替代數據結構，但是和紅黑樹不相同的是，SkipList對於樹的平衡的實現是基於一種隨機化的算法的，這樣也就是說SkipList的插入和刪除的工作是比較高效的.

關於SkipList的詳細介紹可以參考這篇文章：http://www.cnblogs.com/xuqiang/archive/2011/05/22/2053516.html，LevelDb的SkipList基本上是一個具體實現，並無特殊之處

SkipList不僅是維護有序數據的一個簡單實現，而且相比較平衡樹來說，在插入數據的時候可以避免頻繁的樹節點調整操作，所以寫入效率是很高的，LevelDb整體而言是個高寫入系統，SkipList在其中應該也起到了很重要的作用。Redis爲了加快插入操作，也使用了SkipList來作爲內部實現數據結構。

SSTable讀寫操作

寫入操作

對於SSTable而言, 插入, 更新, 刪除, 都是通過append來實現的, 只不過delete插入的“Key:刪除標記”, 後臺Compaction的時候纔去做真正的刪除操作, 如key3

讀操作

SSTable的讀操作比較複雜一些, 不過下圖還是比較好的反映出讀取的過程,

MemTable –> Immutable MemTable –> Level0 SSTable –> Level1 SSTable –> Leveln

這個順序是有很道理的, 由於SSTable所有的寫都是append, 所以同一個key的value可能有很多版本, 而我們只關心最新的那個
所以我們只要安裝這個順序區讀, 就能保證讀到最新的那個版本

對於Level0 SSTable稍微特殊些, 因爲對於這個級別SSTable files之間key會有重複的, 所以讀的時候, 先找出level 0中哪些文件包含這個key, 並取最新的

怎樣從.sst文件裏面讀到數據?

levelDb一般會先在內存中的Cache中查找是否包含這個文件的緩存記錄，如果包含，則從緩存中讀取；如果不包含，則打開SSTable文件，同時將這個文件的索引部分加載到內存中並放入Cache中。這樣Cache裏面就有了這個SSTable的緩存項，但是隻有索引部分在內存中，之後levelDb根據索引可以定位到哪個內容Block會包含這條key，從文件中讀出這個Block的內容，在根據記錄一一比較，如果找到則返回結果，如果沒有找到，那麼說明這個level的SSTable文件並不包含這個key，所以到下一級別的SSTable中去查找

可以看出對於SSTable, 相對寫操作，讀操作處理起來要複雜很多，所以寫的速度必然要遠遠高於讀數據的速度，也就是說，LevelDb比較適合寫操作多於讀操作的應用場合。而如果應用是很多讀操作類型的，那麼順序讀取效率會比較高，因爲這樣大部分內容都會在緩存中找到，儘可能避免大量的隨機讀取操作。

順序讀, 加載一個SSTable到內存, 可以讀很多kv, 因爲kv在sst文件中就是按順序存放的, 如果隨機讀, 效率就比較低, 因爲cache的命中率很低, 需要頻繁的open不同的sst文件.

詳解SSTable結構和LSMTree索引

轉載自：http://www.cnblogs.com/fxjwind/archive/2012/08/14/2638371.html

http://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/, SSTable and Log Structured Storage: LevelDB

SSTable: Sorted String Table

SSTables and Log Structured Merge Trees

LSM & SSTables: Updates, Deletes and Maintenance

SSTables and LevelDB

LevelDB中LSM Tree的實現細節

整體架構

log文件結構

SSTable文件結構

MemTable詳解

SSTable讀寫操作

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

druid數據源 xml配置

經典論文翻譯導讀之《Google File System》

Spark：大數據的“電光石火”

對雲計算中幾種基礎設施（Dynamo,Bigtable,Map/Reduce等）的樸素看法

libco協程庫上下文切換原理詳解

詳解SSTable結構和LSMTree索引

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結