RocksDB 源碼分析-接口下的數據結構

RocksDB 源碼分析-接口下的數據結構

RocksDB是非常流行的KV數據庫,是LSM-Tree數據庫的典型代表,很多分佈式數據庫NewSQL、圖數據庫都使用RocksDB作爲底層存儲引擎,RocksDB在穩定性和性能等方面都比較出色。

HugeGraph圖數據庫底層也支持RocksDB作爲後端存儲,HugeGraph使用的是Java語言,RocksDB是C++語言編寫,幸好官方提供了Java JNI接口可直接使用。RocksDB的功能非常聚焦,可以簡單理解爲其提供一個個Map來存取鍵值對,所以核心接口基本就是put、get、scan等,使用起來還是比較簡單。不過簡單的接口下面,蘊含了非常複雜的內部結構,本文對其接口下的幾個核心結構進行分析。

最頻繁使用的RocksDB接口:

  • RocksDB:數據庫實例,所有操作的入口
  • ColumnFamilyHandle:CF描述符,類似文件描述符,可簡單理解爲Map的指針
  • RocksIterator:查詢迭代器,scan查詢的操作接口

先看幾個問題:

  1. Iterator、ColumnFamilyHandle 背後的是怎麼把 MemTable、ImmMemTable、Manifest、SST 等組織起來的?
  2. 要查找某個 CF 中指定key範圍的值,如何定位到某個文件的某個位置?
  3. Iterator 的生命週期如何管理?在 CF close 之後 Iterator 如何保持依舊可用而不被釋放?

重點類結構及其關係:ColumnFamilyHandle

ColumnFamilyHandle <--- ColumnFamilyHandleImpl ---+ ColumnFamilyData  ---+ SuperVersion   -----------------------+ Version current ------------------+ uint64_t version_number
                                                  + MemTable mem         + MemTableListVersion imm               + ColumnFamilyData cfd
                                                  + MemTableList imm     + MemTable mem why?                     + VersionStorageInfo storage_info (SSTs meta)
                                                  + Ref refs             + Ref refs
                                                  + ColumnFamilyOptions
                                                  + ColumnFamilyData (next & prev)

ColumnFamilyHandle是CF(類似表Table)的描述符,從創建CF或打開數據庫時,就可以拿到各CF的Handle,對錶的任何操作都需要ColumnFamilyHandle描述符來進行,比如put、get、scan,如示例:rocksdb.put(cfHandle, key, value)。

ColumnFamilyHandle可通過如下示例代碼獲取:

cfHandle=RocksDB.createCF()

cfHandles=RocksDB.open(cfNames) 

ColumnFamilyHandle下層的ColumnFamilyData則管理着CF的各種狀態、資源,包括memtable、immutables,以及通過SuperVersion管理CF的元數據,如當前版本號、SSTs文件信息等,而所有的ColumnFamilyData都放在db實例的ColumnFamilySet中。

重點類結構及其關係:Iterator

Iterator <--- ArenaWrappedDBIter ---+ DBIter db_iter ----------+ InternalIterator iter <------- MergingIterator ---+ vector<InternalIterator> children  ---+ MemTableIterator memtable
                                    + Arena arena              + bool valid                                        + MergerMinIterHeap minHeap             + MemTableIterator immutables
                                    + uint64_t sv_number       + IterKey saved_key                                 + InternalIterator current              + BlockBasedTableIterator level 0
                                                               + string saved_value                                                                        + LevelIterator level 1~n
                                                               + SequenceNumber sequence
                                                               + iterate_lower_bound、iterate_upper_bound、prefix_start_key
                                                               + user_comparator、merge_operator、prefix_extractor
                                                               + LocalStatistics local_stats

查詢時,最外層使用RocksDB.newIterator(cfHandle)來得到Iterator,進一步通過Iterator來查詢指定CF的數據,除點查get操作根據key獲得value外,其它所有查詢都是基於Iterator之上的,包括全表掃描、範圍查找(大於、小於、區間)、前綴查找等。Iterator涵蓋內容和生命週期都比較複雜,讀取路徑基本蘊含RocksDB的大部分關鍵概念。

構建最外層迭代器:RocksDB.newIterator(cfHandle) 調用棧:

ArenaWrappedDBIter::Init 0x7feefb8f5c00, allow_refresh_=1
ArenaWrappedDBIter::Init()
 0   librocksdbjni3300438414871377681.jnilib 0x0000000121dc8236 _ZN7rocksdb18ArenaWrappedDBIter4InitEPNS_3EnvERKNS_11ReadOptionsERKNS_18ImmutableCFOptionsERKyyyPNS_12ReadCallbackEbb + 214
 1   librocksdbjni3300438414871377681.jnilib 0x0000000121dc85ba _ZN7rocksdb25NewArenaWrappedDbIteratorEPNS_3EnvERKNS_11ReadOptionsERKNS_18ImmutableCFOptionsERKyyyPNS_12ReadCallbackEPNS_6DBImplEPNS_16ColumnFamilyDataEbb + 266
 2   librocksdbjni3300438414871377681.jnilib 0x0000000121d640f9 _ZN7rocksdb6DBImpl11NewIteratorERKNS_11ReadOptionsEPNS_18ColumnFamilyHandleE + 617
 3   librocksdbjni3300438414871377681.jnilib 0x0000000121c8757e Java_org_rocksdb_RocksDB_iteratorCF__JJ + 78

RocksDB.newIterator()返回的是一個ArenaWrappedDBIter對象,ArenaWrappedDBIter相當於一個外殼,其持有的DBIter包括了大量的狀態變量(上圖最高部分,如當前讀取key&value),還持有一個內部迭代器InternalIterator,DBIter的作用是將查詢轉發給底層InternalIterator,InternalIterator返回的KV是原始的二進制數據,DBIter獲取到數據之後解析爲有含義的內容,包括版本號sequence(末尾8-1字節)、操作類型type(末尾1字節,包括普通的Value Key、刪除操作Delete Key、合併操作Merge Key等)、實際用戶Key內容,比如Delete Key則需要跳過去讀取下一個Key,Merge Key則需要合併新老值,處理完成之後才返回結果。

其中Arena是用來存放DBIter以及其內部的InternalIterator的,目的是用於防止過多小內存碎片,DBIter中包括大量成員,Arena申請了一大片空間用於存放所有這些成員,而非每個成員申請一小點內存。

此外,ArenaWrappedDBIter還包括部分額外用於迭代器 Refresh 的信息ColumnFamilyData cfd_ 、DBImpl db_impl_ 、ReadOptions read_options_,Refresh是指當SuperVersionNumber比創建迭代器時的版本更新時,需要重新創建內部DBIter和InternalIterator,詳見方法ArenaWrappedDBIter::Refresh()

詳細的KV格式見 db/memtable.cc / MemTable::Add():internal_key_size(varint) + internal_key(user_key+sequence+type) + value_size(varint) + value。對於上層來說其中的user_key可能還在真正的用戶數據末尾包含了timestamp。

WriteBatch層格式見 db/write_batch.cc / WriteBatchInternal::Put():tag(type) + cf_id(varint) + key_and_timestamp_size(varint) + key_data + timestamp + value_size(varint) + value_data。

注意當啓用TTL時,DBWithTTLImpl::Write()中顯示,timestamp是加到value後面的4字節,TTL的過濾見TtlCompactionFilter。

更多Put()內容見 DBImpl::WriteImpl() -> WriteBatchInternal::InsertInto() -> WriteBatch::Iterate() -> WriteBatchInternal::Iterate() -> MemTableInserter::PutCFImpl() -> MemTable::Add()。

MergingIterator是一個包羅萬象的迭代器,是InternalIterator的一種,下層的各種類型的子迭代器都被放在MergingIterator中,包括memtable、immutables、SSTs的InternalIterator,由一個vector集合持有,並通過最小堆minHeap來優化pick哪個字迭代器的KV。

重點代碼概覽:

  • 構建InternalIterator:DBImpl::NewInternalIterator(),代碼詳見末尾。
  • MergingIterator從子迭代器中選擇讀取下一個鍵值:MergingIterator::SeekToFirst() & Next(),代碼詳見末尾。
  • 迭代器解析數據方法:DBIter::FindNextUserEntryInternal(),代碼詳見末尾。

解答一下開頭的幾個問題:

問題1,Iterator、ColumnFamilyHandle 背後的是怎麼把 MemTable、ImmMemTable、Manifest、SST 等組織起來的?

從上面的分析看應該基本清楚了。

問題2,要查找某個 CF 中指定key範圍的值,如何定位到某個文件的某個位置?

從 ArenaWrappedDBIter::Seek(const Slice& target) 方法一直往下追即可,到 MergingIterator::Seek(const Slice& target) 時,對所有的子迭代器進行一次Seek,然後按key排序將子迭代器放入最小堆中,返回最小key的子迭代器,通過 ArenaWrappedDBIter::Next() 獲取下一個key時,將上次最小迭代器的值取走,接着依然返回最小key的子迭代器,如此循環往復直到上界。

那麼子迭代器的Seek是如何完成的?

  • 內存中的MemTableIterator的Seek,以SkipList表爲例,會通過SkipListRep::Iterator::Seek()找到SkipList對應的節點;
  • level 0 SST文件(可能有多個)的Seek,會通過BlockBasedTableIterator::Seek()/PlainTableIterator::Seek()找到,BlockBasedTable是SST的默認格式,BlockBasedTableIterator內部又通過SST的Block索引IndexIterator::Seek()來快速定位文件內部大致位置(哪個Block,一搬一個Block爲4K大小),最終在Block內通過BlockIter::Seek()以二分查找找到key對應的具體Entry;
  • level 1~n SST文件的Seek,則是每層有一個LevelIterator,對於一層的多個SST文件,其內容都是排好序的,LevelIterator::Seek()先找到key對應的該層文件,並返回某個SST文件的BlockBasedTableIterator,再調用BlockBasedTableIterator::Seek(),接下來流程與上述level 0中分析類似;

問題3,Iterator 的生命週期如何管理?在 CF close 之後 Iterator 如何保持依舊可用而不被釋放?

在ColumnFamilyData結構中有一個refs引用計數,當調用ColumnFamilyHandle.close()釋放CF描述符時,只會對下層的ColumnFamilyData引用減1,只有引用refs=0時才真正釋放(代碼參考析構函數~ColumnFamilyHandleImpl())。

 

關鍵結構


關鍵結構:ColumnFamilyData

代碼路徑:rocksdb/db/column_family.cc

// This class keeps all the data that a column family needs.
// Most methods require DB mutex held, unless otherwise noted
class ColumnFamilyData {
  uint32_t id_;
  const std::string name_;
  Version* dummy_versions_;  // Head of circular doubly-linked list of versions.
  Version* current_;         // == dummy_versions->prev_

  std::atomic<int> refs_;      // outstanding references to ColumnFamilyData
  std::atomic<bool> initialized_;
  std::atomic<bool> dropped_;  // true if client dropped it

  const InternalKeyComparator internal_comparator_;
  std::vector<std::unique_ptr<IntTblPropCollectorFactory>>
      int_tbl_prop_collector_factories_;

  const ColumnFamilyOptions initial_cf_options_;
  const ImmutableCFOptions ioptions_;
  MutableCFOptions mutable_cf_options_;

  const bool is_delete_range_supported_;

  std::unique_ptr<TableCache> table_cache_;

  std::unique_ptr<InternalStats> internal_stats_;

  WriteBufferManager* write_buffer_manager_;

  MemTable* mem_;
  MemTableList imm_;
  SuperVersion* super_version_;

  // An ordinal representing the current SuperVersion. Updated by
  // InstallSuperVersion(), i.e. incremented every time super_version_
  // changes.
  std::atomic<uint64_t> super_version_number_;

  // Thread's local copy of SuperVersion pointer
  // This needs to be destructed before mutex_
  std::unique_ptr<ThreadLocalPtr> local_sv_;

  // pointers for a circular linked list. we use it to support iterations over
  // all column families that are alive (note: dropped column families can also
  // be alive as long as client holds a reference)
  ColumnFamilyData* next_;
  ColumnFamilyData* prev_;

  // This is the earliest log file number that contains data from this
  // Column Family. All earlier log files must be ignored and not
  // recovered from
  uint64_t log_number_;

  std::atomic<FlushReason> flush_reason_;

  // An object that keeps all the compaction stats
  // and picks the next compaction
  std::unique_ptr<CompactionPicker> compaction_picker_;

  ColumnFamilySet* column_family_set_;

  std::unique_ptr<WriteControllerToken> write_controller_token_;

  // If true --> this ColumnFamily is currently present in DBImpl::flush_queue_
  bool queued_for_flush_;

  // If true --> this ColumnFamily is currently present in
  // DBImpl::compaction_queue_
  bool queued_for_compaction_;

  uint64_t prev_compaction_needed_bytes_;

  // if the database was opened with 2pc enabled
  bool allow_2pc_;

  // Memtable id to track flush.
  std::atomic<uint64_t> last_memtable_id_;

  // Directories corresponding to cf_paths.
  std::vector<std::unique_ptr<Directory>> data_dirs_;
};

關鍵結構:ArenaWrappedDBIter

代碼路徑:rocksdb/db/db_iter.cc(rocksdb/db/db_impl.cc ArenaWrappedDBIter* DBImpl::NewIteratorImpl() <= Iterator* DBImpl::NewIterator())

// A wrapper iterator which wraps DB Iterator and the arena, with which the DB
// iterator is supposed be allocated. This class is used as an entry point of
// a iterator hierarchy whose memory can be allocated inline. In that way,
// accessing the iterator tree can be more cache friendly. It is also faster
// to allocate.
class ArenaWrappedDBIter : public Iterator {
  DBIter* db_iter_;
  Arena arena_;
  uint64_t sv_number_;
  ColumnFamilyData* cfd_ = nullptr;
  DBImpl* db_impl_ = nullptr;
  ReadOptions read_options_;
  ReadCallback* read_callback_;
  bool allow_blob_ = false;
  bool allow_refresh_ = true;
};
ArenaWrappedDBIter* DBImpl::NewIteratorImpl(const ReadOptions& read_options,
                                            ColumnFamilyData* cfd,
                                            SequenceNumber snapshot,
                                            ReadCallback* read_callback,
                                            bool allow_blob,
                                            bool allow_refresh) {
  // Try to generate a DB iterator tree in continuous memory area to be
  // cache friendly. Here is an example of result:
  // +-------------------------------+
  // |                               |
  // | ArenaWrappedDBIter            |
  // |  +                            |
  // |  +---> Inner Iterator   ------------+
  // |  |                            |     |
  // |  |    +-- -- -- -- -- -- -- --+     |
  // |  +--- | Arena                 |     |
  // |       |                       |     |
  // |          Allocated Memory:    |     |
  // |       |   +-------------------+     |
  // |       |   | DBIter            | <---+
  // |           |  +                |
  // |       |   |  +-> iter_  ------------+
  // |       |   |                   |     |
  // |       |   +-------------------+     |
  // |       |   | MergingIterator   | <---+
  // |           |  +                |
  // |       |   |  +->child iter1  ------------+
  // |       |   |  |                |          |
  // |           |  +->child iter2  ----------+ |
  // |       |   |  |                |        | |
  // |       |   |  +->child iter3  --------+ | |
  // |           |                   |      | | |
  // |       |   +-------------------+      | | |
  // |       |   | Iterator1         | <--------+
  // |       |   +-------------------+      | |
  // |       |   | Iterator2         | <------+
  // |       |   +-------------------+      |
  // |       |   | Iterator3         | <----+
  // |       |   +-------------------+
  // |       |                       |
  // +-------+-----------------------+

 

詳細代碼


構建InternalIterator:DBImpl::NewInternalIterator():

InternalIterator* DBImpl::NewInternalIterator(
    const ReadOptions& read_options, ColumnFamilyData* cfd,
    SuperVersion* super_version, Arena* arena,
    RangeDelAggregator* range_del_agg) {
  InternalIterator* internal_iter;
  assert(arena != nullptr);
  assert(range_del_agg != nullptr);
  // Need to create internal iterator from the arena.
  MergeIteratorBuilder merge_iter_builder(
      &cfd->internal_comparator(), arena,
      !read_options.total_order_seek &&
          cfd->ioptions()->prefix_extractor != nullptr);
  // Collect iterator for mutable mem
  merge_iter_builder.AddIterator(
      super_version->mem->NewIterator(read_options, arena));
  std::unique_ptr<InternalIterator> range_del_iter;
  Status s;
  if (!read_options.ignore_range_deletions) {
    range_del_iter.reset(
        super_version->mem->NewRangeTombstoneIterator(read_options));
    s = range_del_agg->AddTombstones(std::move(range_del_iter));
  }
  // Collect all needed child iterators for immutable memtables
  if (s.ok()) {
    super_version->imm->AddIterators(read_options, &merge_iter_builder);
    if (!read_options.ignore_range_deletions) {
      s = super_version->imm->AddRangeTombstoneIterators(read_options, arena,
                                                         range_del_agg);
    }
  }
  TEST_SYNC_POINT_CALLBACK("DBImpl::NewInternalIterator:StatusCallback", &s);
  if (s.ok()) {
    // Collect iterators for files in L0 - Ln
    if (read_options.read_tier != kMemtableTier) {
      super_version->current->AddIterators(read_options, env_options_,
                                           &merge_iter_builder, range_del_agg);
    }
    internal_iter = merge_iter_builder.Finish();
    IterState* cleanup =
        new IterState(this, &mutex_, super_version,
                      read_options.background_purge_on_iterator_cleanup);
    internal_iter->RegisterCleanup(CleanupIteratorState, cleanup, nullptr);

    return internal_iter;
  } else {
    CleanupSuperVersion(super_version);
  }
  return NewErrorInternalIterator(s, arena);
}

 

MergingIterator從子迭代器中選擇讀取下一個key,其中用到最小堆加速pick:MergingIterator::SeekToFirst() & Next()

virtual void SeekToFirst() override {
    ClearHeaps();
    status_ = Status::OK();
    for (auto& child : children_) {
      child.SeekToFirst();
      if (child.Valid()) {
        assert(child.status().ok());
        minHeap_.push(&child);
      } else {
        considerStatus(child.status());
      }
    }
    direction_ = kForward;
    current_ = CurrentForward();
  }
  
  IteratorWrapper* CurrentForward() const {
    assert(direction_ == kForward);
    return !minHeap_.empty() ? minHeap_.top() : nullptr;
  }
  
  virtual void Next() override {
    assert(Valid());

    // Ensure that all children are positioned after key().
    // If we are moving in the forward direction, it is already
    // true for all of the non-current children since current_ is
    // the smallest child and key() == current_->key().
    if (direction_ != kForward) {
      SwitchToForward();
      // The loop advanced all non-current children to be > key() so current_
      // should still be strictly the smallest key.
      assert(current_ == CurrentForward());
    }

    // For the heap modifications below to be correct, current_ must be the
    // current top of the heap.
    assert(current_ == CurrentForward());

    // as the current points to the current record. move the iterator forward.
    current_->Next();
    if (current_->Valid()) {
      // current is still valid after the Next() call above.  Call
      // replace_top() to restore the heap property.  When the same child
      // iterator yields a sequence of keys, this is cheap.
      assert(current_->status().ok());
      minHeap_.replace_top(current_);
    } else {
      // current stopped being valid, remove it from the heap.
      considerStatus(current_->status());
      minHeap_.pop();
    }
    current_ = CurrentForward();
  }

 

迭代器解析數據方法:DBIter::FindNextUserEntryInternal():

bool DBIter::FindNextUserEntryInternal(bool skipping, bool prefix_check) {
  // Loop until we hit an acceptable entry to yield
  assert(iter_->Valid());
  assert(status_.ok());
  assert(direction_ == kForward);
  current_entry_is_merged_ = false;

  // How many times in a row we have skipped an entry with user key less than
  // or equal to saved_key_. We could skip these entries either because
  // sequence numbers were too high or because skipping = true.
  // What saved_key_ contains throughout this method:
  //  - if skipping        : saved_key_ contains the key that we need to skip,
  //                         and we haven't seen any keys greater than that,
  //  - if num_skipped > 0 : saved_key_ contains the key that we have skipped
  //                         num_skipped times, and we haven't seen any keys
  //                         greater than that,
  //  - none of the above  : saved_key_ can contain anything, it doesn't matter.
  uint64_t num_skipped = 0;

  is_blob_ = false;

  do {
    if (!ParseKey(&ikey_)) {
      return false;
    }

    if (iterate_upper_bound_ != nullptr &&
        user_comparator_->Compare(ikey_.user_key, *iterate_upper_bound_) >= 0) {
      break;
    }

    if (prefix_extractor_ && prefix_check &&
        prefix_extractor_->Transform(ikey_.user_key)
                .compare(prefix_start_key_) != 0) {
      break;
    }

    if (TooManyInternalKeysSkipped()) {
      return false;
    }

    if (IsVisible(ikey_.sequence)) {
      if (skipping && user_comparator_->Compare(ikey_.user_key,
                                                saved_key_.GetUserKey()) <= 0) {
        num_skipped++;  // skip this entry
        PERF_COUNTER_ADD(internal_key_skipped_count, 1);
      } else {
        num_skipped = 0;
        switch (ikey_.type) {
          case kTypeDeletion:
          case kTypeSingleDeletion:
            // Arrange to skip all upcoming entries for this key since
            // they are hidden by this deletion.
            // if iterartor specified start_seqnum we
            // 1) return internal key, including the type
            // 2) return ikey only if ikey.seqnum >= start_seqnum_
            // note that if deletion seqnum is < start_seqnum_ we
            // just skip it like in normal iterator.
            if (start_seqnum_ > 0 && ikey_.sequence >= start_seqnum_)  {
              saved_key_.SetInternalKey(ikey_);
              valid_ = true;
              return true;
            } else {
              saved_key_.SetUserKey(
                ikey_.user_key,
                !pin_thru_lifetime_ || !iter_->IsKeyPinned() /* copy */);
              skipping = true;
              PERF_COUNTER_ADD(internal_delete_skipped_count, 1);
            }
            break;
          case kTypeValue:
          case kTypeBlobIndex:
            if (start_seqnum_ > 0) {
              // we are taking incremental snapshot here
              // incremental snapshots aren't supported on DB with range deletes
              assert(!(
                (ikey_.type == kTypeBlobIndex) && (start_seqnum_ > 0)
              ));
              if (ikey_.sequence >= start_seqnum_) {
                saved_key_.SetInternalKey(ikey_);
                valid_ = true;
                return true;
              } else {
                // this key and all previous versions shouldn't be included,
                // skipping
                saved_key_.SetUserKey(ikey_.user_key,
                  !pin_thru_lifetime_ || !iter_->IsKeyPinned() /* copy */);
                skipping = true;
              }
            } else {
              saved_key_.SetUserKey(
                  ikey_.user_key,
                  !pin_thru_lifetime_ || !iter_->IsKeyPinned() /* copy */);
              if (range_del_agg_.ShouldDelete(
                      ikey_, RangeDelAggregator::RangePositioningMode::
                                 kForwardTraversal)) {
                // Arrange to skip all upcoming entries for this key since
                // they are hidden by this deletion.
                ...
            }
            break;
          case kTypeMerge:
            saved_key_.SetUserKey(
                ikey_.user_key,
                !pin_thru_lifetime_ || !iter_->IsKeyPinned() /* copy */);
            if (range_del_agg_.ShouldDelete(
                    ikey_, RangeDelAggregator::RangePositioningMode::
                               kForwardTraversal)) {
              // Arrange to skip all upcoming entries for this key since
              // they are hidden by this deletion.
              skipping = true;
              num_skipped = 0;
              PERF_COUNTER_ADD(internal_delete_skipped_count, 1);
            } else {
              // By now, we are sure the current ikey is going to yield a
              // value
              current_entry_is_merged_ = true;
              valid_ = true;
              return MergeValuesNewToOld();  // Go to a different state machine
            }
            break;
          default:
            assert(false);
            break;
        }
      }
    } else {
      PERF_COUNTER_ADD(internal_recent_skipped_count, 1);

      // This key was inserted after our snapshot was taken.
      // If this happens too many times in a row for the same user key, we want
      // to seek to the target sequence number.
      int cmp =
          user_comparator_->Compare(ikey_.user_key, saved_key_.GetUserKey());
      if (cmp == 0 || (skipping && cmp <= 0)) {
        num_skipped++;
      } else {
        saved_key_.SetUserKey(
            ikey_.user_key,
            !iter_->IsKeyPinned() || !pin_thru_lifetime_ /* copy */);
        skipping = false;
        num_skipped = 0;
      }
    }

    // If we have sequentially iterated via numerous equal keys, then it's
    // better to seek so that we can avoid too many key comparisons.
    if (num_skipped > max_skip_) {
      num_skipped = 0;
      std::string last_key;
      if (skipping) {
        // We're looking for the next user-key but all we see are the same
        // user-key with decreasing sequence numbers. Fast forward to
        // sequence number 0 and type deletion (the smallest type).
        AppendInternalKey(&last_key, ParsedInternalKey(saved_key_.GetUserKey(),
                                                       0, kTypeDeletion));
        // Don't set skipping = false because we may still see more user-keys
        // equal to saved_key_.
      } else {
        // We saw multiple entries with this user key and sequence numbers
        // higher than sequence_. Fast forward to sequence_.
        // Note that this only covers a case when a higher key was overwritten
        // many times since our snapshot was taken, not the case when a lot of
        // different keys were inserted after our snapshot was taken.
        AppendInternalKey(&last_key,
                          ParsedInternalKey(saved_key_.GetUserKey(), sequence_,
                                            kValueTypeForSeek));
      }
      iter_->Seek(last_key);
      RecordTick(statistics_, NUMBER_OF_RESEEKS_IN_ITERATION);
    } else {
      iter_->Next();
    }
  } while (iter_->Valid());

  valid_ = false;
  return iter_->status().ok();
}

<end>

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章