RoaringBitmap位圖數據結構及源碼分析

一、前言

本文將先講述Bitmap(位圖算法)的基本原理以及使用場景,進而深入分析Bitmap的一種實現算法:RoaringBitmap,通過對其數據結構以及部分核心源碼的分析去了解實現過程
RoaringBitmap相關官方文章地址:


二、Bitmap

1、什麼是位圖

我們先來看一個經常被提到的面試問題:

有40億個不重複且未排序的unsigned int整數,如何用一臺內存爲2G的PC判斷某一個數是否在這40億個整數中

先看下處理這40億個整數至少需要消耗多少內存:一個int佔4個字節,40億*4/1024/1024/1024≈14.9G 遠遠大於指定的2G內存,按照通常的int存儲明顯無法在2G內存中工作,這時候就需要位圖來處理了

位圖就是用一個bit位來標記一個數字,第N位(從0開始)的bit代表着整數N,1即存在,0即不存在
舉個例子:在一個1byte的內存空間中插入2、3、5、6四個數字
在這裏插入圖片描述
從上圖可以看出,我們在1byte的內存空間,即8位中插入2、3、5、6,只需要將對應下標的bit從0(綠色)置爲1(藍色)即可,原本需要耗費4個int即16byte去存儲,現在1個byte就解決了。
回到開始的問題,若是連續的40億個數,則需要40億/8/1024/1024≈476MB,大大降低了內存消耗

2、位圖的應用

位圖經常被用作處理海量數據的壓縮,並對其進行排序、去重、查詢、差集、並集、交集等操作

3、位圖存在的問題

雖說上面的例子中,我們將14.9G的內存壓縮爲了476MB,節省了將近97%的內存空間,但這是基於數據連續的情況,即數據密集
若只存放40億這1個數字,仍然需要476MB的內存空間去存儲,那麼就得不償失了,所以只有當數據較爲密集的時候使用位圖才具有優勢
爲了解決位圖在稀疏數據下的問題,目前有多種壓縮方案以減少內存提高效率:WAH、EWAH、CONCISE、RoaringBitmap等。前三種採用行程長度編碼(Run-length-encoding)進行壓縮,RoaringBitmap則是在壓縮上更進一步,並且兼顧了性能
下面是github上對於幾種bitmap實現方案的描述

How does Roaring compares with the alternatives?
Most alternatives to Roaring are part of a larger family of compressed bitmaps that are run-length-encoded bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word. If you have a local mix of 1s and 0, you use an uncompressed word.
There are many formats in this family:

  • Oracle’s BBC is an obsolete format at this point: though it may provide good compression, it is likely much slower than more recent alternatives due to excessive branching.
    WAH is a patented variation on BBC that provides better performance.
  • Concise is a variation on the patented WAH. It some specific instances, it can compress much better than WAH (up to 2x better), but it is generally slower.
  • EWAH is both free of patent, and it is faster than all the above. On the downside, it does not compress quite as well. It is faster because it allows some form of “skipping” over uncompressed words. So though none of these formats are great at random access, EWAH is better than the alternatives.

There is a big problem with these formats however that can hurt you badly in some cases: there is no random access. If you want to check whether a given value is present in the set, you have to start from the beginning and “uncompress” the whole thing. This means that if you want to intersect a big set with a large set, you still have to uncompress the whole big set in the worst case…
Roaring solves this problem. It works in the following manner. It divides the data into chunks of 216 integers (e.g., [0, 216), [216, 2 x 216), …). Within a chunk, it can use an uncompressed bitmap, a simple list of integers, or a list of runs. Whatever format it uses, they all allow you to check for the present of any one value quickly (e.g., with a binary search). The net result is that Roaring can compute many operations much faster that run-length-encoded formats like WAH, EWAH, Concise… Maybe surprisingly, Roaring also generally offers better compression ratios.


三、RoaringBitmap

1、實現思路

RoaringBitmap(以下簡稱RBM)處理的是int型整數,RBM將一個32位的int拆分爲高16位與低16位分開去處理,其中高16位作爲索引,低16位作爲實際存儲數據。
RBM按照索引將數據分塊存放,每個塊中的整數共享高16位,第1個塊包含了[0,216),它們的高16位均爲0,第2個塊包含了[216,2x216),它們的高16位均爲1…以此類推,每個RBM中最多可以有216個塊,每個塊中最多可以存放216個數據
若不是很理解可以看下面這張圖
在這裏插入圖片描述
下面我們看下RBM的數據結構:
RoaringArray:每個RBM都包含了一個RoaringArray,名字是highLowContainer,主要有下面幾個重要屬性:
keys:short數組,用來存儲高16位作爲索引
valuesContainer數組,用來存儲低16位數據
size:用來記錄當前RBM包含的key-value有效數量
注意:keys數組和values數組是一一對應的,且keys永遠保證有序,這是爲了之後索引的二分查找

RoaringBitmap

  RoaringArray highLowContainer = null;

  /**
   * Create an empty bitmap
   */
  public RoaringBitmap() {
    highLowContainer = new RoaringArray();
  }

RoaringArray

  static final int INITIAL_CAPACITY = 4;

  //存儲高16位作爲索引
  short[] keys = null;

  //用不同的Container存儲低16位
  Container[] values = null;

  int size = 0;

  protected RoaringArray() {
    this(INITIAL_CAPACITY);
  }

我們大致簡單的總結一下RBM實現原理:
將一個32位無符號整數按照高16位分塊處理,其中高16位作爲索引存儲在short數組中,低16位作爲數據存儲在某個特定的Container數組中。存儲數據時,先根據高16位找到對應的索引key(二分查找),由於key和value是一一對應的,即找到了對應的Container。若key存在則將低16位放入對應的Container中,若不存在則創建一個key和對應的Container,並將低16位放入Container中

RBM的核心就是Container,也是RBM優於其他方案的關鍵,下面將對其進行介紹

2、Container

Container用於存儲低16位的數據,根據數據量以及疏密程度分爲以下3個容器:

  • ArrayContainer
  • BitmapContainer
  • RunContainer

下面將先對3種容器做一個大概的介紹,具體實現流程可以結合下一節的源碼分析

①ArrayContainer
  //ArrayContainer允許的最大數據量
  static final int DEFAULT_MAX_SIZE = 4096;// containers with DEFAULT_MAX_SZE or less integers
                                           // should be ArrayContainers

  //記錄基數
  protected int cardinality = 0;

  //用short數組存儲數據
  short[] content;

ArrayContainer採用簡單的short數組存儲低16位數據,content始終有序且不重複,方便二分查找
可以看出,ArrayContainer並沒有採用任何的壓縮算法,只是簡單的將低16存儲在short[]中,所以ArrayContainer佔用的內存空間大小和存儲的數據量呈線性關係:short爲2字節,因此n個數據爲2n字節
隨着數據量的增大,ArrayContainer佔用的內存空間逐漸增多,且由於是二分查找,時間複雜度爲O(logn),查找效率也會大打折扣,因此ArrayContainer規定最大數據量是4096,即8kb。至於爲什麼閾值是4096,我們需要結合下一個Container一併分析:BitmapContainer

②BitmapContainer
  //最大容量
  protected static final int MAX_CAPACITY = 1 << 16;
  
  //用一個定長的long數組按bit存儲數據
  final long[] bitmap;

  //記錄基數
  int cardinality;

BitmapContainer採用long數組存儲低16位數據,這就是一個未壓縮的普通位圖,每一個bit位置代表一個數字。我們上面說過每一個Container最多可以處理216個數字,基於位圖的原理,我們就需要216個bit,每個long是8字節64bit,所以需要216/64=1024個long
BitmapContainer構造方法會初始化一個長度爲1024的long數組,因此BitmapContainer無論是存1個數據,10個數據還是最大65536個數據,都始終佔據着8kb的內存空間。這同樣解釋了爲什麼ArrayContainer數據量的最大閾值是4096
由於是直接進行位運算,該容器CRUD的時間複雜度爲O(1)

③RunContainer
  private short[] valueslength;// we interleave values and lengths, so
  // that if you have the values 11,12,13,14,15, you store that as 11,4 where 4 means that beyond 11
  // itself, there are
  // 4 contiguous values that follows.
  // Other example: e.g., 1, 10, 20,0, 31,2 would be a concise representation of 1, 2, ..., 11, 20,
  // 31, 32, 33

  int nbrruns = 0;// how many runs, this number should fit in 16 bits.

在RBM創立初期只有以上兩種容器,RunContainer其實是在後期加入的。RunContainer是基於之前提到的RLE算法進行壓縮的,主要解決了大量連續數據的問題。
舉例說明:3,4,5,10,20,21,22,23這樣一組數據會被優化成3,2,10,0,20,3,原理很簡單,就是記錄初始數字以及連續的數量,並把壓縮後的數據記錄在short數組中
顯而易見,這種壓縮方式對於數據的疏密程度非常敏感,舉兩個最極端的例子:如果這個Container中所有數據都是連續的,也就是[0,1,2.....65535],壓縮後爲0,65535,即2個short,4字節。若這個Container中所有數據都是間斷的(都是偶數或奇數),也就是[0,2,4,6....65532,65534],壓縮後爲0,0,2,0.....65534,0,這不僅沒有壓縮反而膨脹了一倍,65536個short,即128kb
因此是否選擇RunContainer是需要判斷的,RBM提供了一個轉化方法runOptimize()用於對比和其他兩種Container的空間大小,若佔據優勢則會進行轉化


瞭解了3種Container,我們看一下下面這組數據應該如何存儲

        RoaringBitmap roaringBitmap = new RoaringBitmap();
        roaringBitmap.add(1);
        roaringBitmap.add(10);
        roaringBitmap.add(100);
        roaringBitmap.add(1000);
        roaringBitmap.add(10000);
        for (int i = 65536; i < 65536*2; i+=2) {
            roaringBitmap.add(i);
        }
        roaringBitmap.add(65536L*3, 65536L*4);
        roaringBitmap.runOptimize();

調用runOptimize()前
在這裏插入圖片描述
調用runOptimize()後
在這裏插入圖片描述
先根據高16位分割數據,很明顯可以分爲3組,第一組的索引是0,數據是1、10、100、1000、10000;第二組的索引是1,數據是從65536到131071間的所有偶數;第三組的索引是3,數據是從196608到262143之間的所有整數。從結果我們可以看出索引keys是順序排布的,第一組數據由於數據量小於4096,所以是ArrayContainer;第二組數據數據量大於4096,所以是BitmapContainer;第三組數據量大於4096,所以是BitmapContainer。調用runOptimize()方法後,由於第三組數據通過RLE算法優化後會佔用更小的空間,所以轉化爲了RunContainer

3、源碼分析

①添加
  /**
   * Add the value to the container (set the value to "true"), whether it already appears or not.
   *
   * Java lacks native unsigned integers but the x argument is considered to be unsigned.
   * Within bitmaps, numbers are ordered according to {@link Integer#compareUnsigned}.
   * We order the numbers like 0, 1, ..., 2147483647, -2147483648, -2147483647,..., -1.
   *
   * @param x integer value
   */
  @Override
  public void add(final int x) {
  	// 獲取待插入數x的高16位
    final short hb = Util.highbits(x);
    // 計算高16位對應的索引值的下標位置
    final int i = highLowContainer.getIndex(hb);
    // 索引下標大於0說明該索引已存在且創建了對應的Container,則將低16位存入該Container中
    if (i >= 0) {
      highLowContainer.setContainerAtIndex(i,
          highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x)));
      // 若索引下標小於0說明該索引不存在,則直接創建一個ArrayContainer並將低16位放入其中
    } else {
      final ArrayContainer newac = new ArrayContainer();
      highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x)));
    }
  }

RBM添加元素的外部代碼就是這樣的,結合註釋很容易理解,下面我們看下內部具體過程:

首先我們看add方法上面的一段註釋,結合github上的說明

Java lacks native unsigned integers but integers are still considered to be unsigned within Roaring and ordered according to Integer.compareUnsigned. This means that Java will order the numbers like so 0, 1, …, 2147483647, -2147483648, -2147483647,…, -1. To interpret correctly, you can use Integer.toUnsignedLong and Integer.toUnsignedString.

java缺少原生的無符號int,但是在RBM中加入的數字是被認爲無符號的,於是RBM根據Integer.compareUnsigned的結果對數字進行排序,從小到大依次是0, 1, ..., 2147483647, -2147483648, -2147483647,..., -1

取數字x的高16位

  // 將x右移16位並轉化爲short,就是取x的高16位
  protected static short highbits(int x) {
    return (short) (x >>> 16);
  }

計算高16位對應的索引值的下標位置

  // involves a binary search
  protected int getIndex(short x) {
    // before the binary search, we optimize for frequent cases
    // 兩種常見場景可以快速判斷,無需走二分查找:1、RoaringArray的大小爲0,直接返回-1 
    // 2、當前索引是keys數組中的最大值,直接返回size-1,之所以可以這樣判斷是因爲keys是有序的
    if ((size == 0) || (keys[size - 1] == x)) {
      return size - 1;
    }
    // no luck we have to go through the list
    // 其他情況需要走二分查找
    return this.binarySearch(0, size, x);
  }
  
  private int binarySearch(int begin, int end, short key) {
    return Util.unsignedBinarySearch(keys, begin, end, key);
  }

  /**
   * Look for value k in array in the range [begin,end). If the value is found, return its index. If
   * not, return -(i+1) where i is the index where the value would be inserted. The array is assumed
   * to contain sorted values where shorts are interpreted as unsigned integers.
   *
   * @param array array where we search
   * @param begin first index (inclusive)
   * @param end last index (exclusive)
   * @param k value we search for
   * @return count
   */
  public static int unsignedBinarySearch(final short[] array, final int begin, final int end,
      final short k) {
      // 混合二分查找法:二分查找+順序查找,始終採用該策略
    if (USE_HYBRID_BINSEARCH) {
      return hybridUnsignedBinarySearch(array, begin, end, k);
    } else {
      return branchyUnsignedBinarySearch(array, begin, end, k);
    }
  }


  // starts with binary search and finishes with a sequential search
  protected static int hybridUnsignedBinarySearch(final short[] array, final int begin,
      final int end, final short k) {
    int ikey = toIntUnsigned(k);
    // next line accelerates the possibly common case where the value would
    // be inserted at the end
    if ((end > 0) && (toIntUnsigned(array[end - 1]) < ikey)) {
      return -end - 1;
    }
    int low = begin;
    int high = end - 1;
    // 32 in the next line matches the size of a cache line
    while (low + 32 <= high) {
      final int middleIndex = (low + high) >>> 1;
      final int middleValue = toIntUnsigned(array[middleIndex]);

      if (middleValue < ikey) {
        low = middleIndex + 1;
      } else if (middleValue > ikey) {
        high = middleIndex - 1;
      } else {
        return middleIndex;
      }
    }
    // we finish the job with a sequential search
    int x = low;
    for (; x <= high; ++x) {
      final int val = toIntUnsigned(array[x]);
      if (val >= ikey) {
        if (val == ikey) {
          return x;
        }
        break;
      }
    }
    return -(x + 1);
  }

  // 上面提到的無符號int,正數無變化,負數相當於+2^16
  protected static int toIntUnsigned(short x) {
    return x & 0xFFFF;
  }

上面獲取索引下標的過程採用了二分查找+順序查找,若查找範圍大於32則採用二分查找,否則進入順序查找,查找過程不難理解就不贅述了。有一個需要注意的是若找到則返回對應索引的下標,沒有找到則返回對應下標的負數,這是一個很巧妙的設計,同時傳遞了位置以及是否存在兩層信息
接下來就是對索引下標正負分情況處理,若小於0,說明該索引不存在,則直接創建一個ArrayContainer並將低16位放入其中;若大於0,說明該索引已存在且創建了對應的Container,則將低16位存入該Container中。我們接下來分析核心部分,3種Container的add過程

  • ArrayContainer添加過程
  /**
   * running time is in O(n) time if insert is not in order.
   */
  @Override
  public Container add(final short x) {
    // 兩種場景可以不走二分查找:1、基數爲0 
    // 2、當前值大於容器中的最大值,之所以可以這樣操作是因爲content是有序的,最後一個即最大值
    if (cardinality == 0 || (cardinality > 0
            && toIntUnsigned(x) > toIntUnsigned(content[cardinality - 1]))) {
      // 基數大於等於閾值4096轉化爲BitmapContainer並添加元素,轉化邏輯下面會有說明
      if (cardinality >= DEFAULT_MAX_SIZE) {
        return toBitmapContainer().add(x);
      }
      // 若基礎大於等於content數組長度則需要擴容
      if (cardinality >= this.content.length) {
        increaseCapacity();
      }
      // 賦值
      content[cardinality++] = x;
    } else {
      // 通過二分查找找到對應的插入位置
      int loc = Util.unsignedBinarySearch(content, 0, cardinality, x);
      //不存在,需要插入,存在則不處理直接返回(去重效果)
      if (loc < 0) {
        // Transform the ArrayContainer to a BitmapContainer
        // when cardinality = DEFAULT_MAX_SIZE
        // 同上,基數大於等於閾值4096轉化爲BitmapContainer並添加元素
        if (cardinality >= DEFAULT_MAX_SIZE) {
          return toBitmapContainer().add(x);
        }
        // 同上,若基礎大於等於content數組長度則需要擴容
        if (cardinality >= this.content.length) {
          increaseCapacity();
        }
        // insertion : shift the elements > x by one position to
        // the right
        // and put x in it's appropriate place
        // 通過拷貝數組將x插入content數組中
        System.arraycopy(content, -loc - 1, content, -loc, cardinality + loc + 1);
        content[-loc - 1] = x;
        ++cardinality;
      }
    }
    return this;
  }

ArrayContainer轉化爲BitmapContainer相關代碼

  /**
   * Copies the data in a bitmap container.
   *
   * @return the bitmap container
   */
  @Override
  public BitmapContainer toBitmapContainer() {
    BitmapContainer bc = new BitmapContainer();
    bc.loadData(this);
    return bc;
  }


  /**
   * Create a bitmap container with all bits set to false
   */
  public BitmapContainer() {
    this.cardinality = 0;
    // 長度固定爲1024
    this.bitmap = new long[MAX_CAPACITY / 64];
  }


  protected void loadData(final ArrayContainer arrayContainer) {
    this.cardinality = arrayContainer.cardinality;
    for (int k = 0; k < arrayContainer.cardinality; ++k) {
      final short x = arrayContainer.content[k];
      //循環賦值,這裏的算法會在BitmapContainer添加過程中詳述
      bitmap[Util.toIntUnsigned(x) / 64] |= (1L << x);
    }
  }

擴容相關代碼

  // temporarily allow an illegally large size, as long as the operation creating
  // the illegal container does not return it.
  // 根據不同的情況進行擴容,不是很難理解
  private void increaseCapacity(boolean allowIllegalSize) {
    int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE
        : this.content.length < 64 ? this.content.length * 2
            : this.content.length < 1067 ? this.content.length * 3 / 2
                : this.content.length * 5 / 4;
    // never allocate more than we will ever need
    if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) {
      newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
    }
    // if we are within 1/16th of the max, go to max
    if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16
        && !allowIllegalSize) {
      newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
    }
    this.content = Arrays.copyOf(this.content, newCapacity);
  }

  • BitmapContainer添加過程
  @Override
  public Container add(final short i) {
    final int x = Util.toIntUnsigned(i);
    final long previous = bitmap[x / 64];
    long newval = previous | (1L << x);
    bitmap[x / 64] = newval;
    if (USE_BRANCHLESS) {
      cardinality += (previous ^ newval) >>> x;
    } else if (previous != newval) {
      ++cardinality;
    }
    return this;
  }

代碼不是很多,我們分析一下:
要將x添加到BitmapContainer中,就是先找到x在哪個long裏,並將該long對應的bit位置爲1,其實就是求x/64的商(第幾個long)和餘(long的第幾位)
x/64取整找到long數組的索引,final long previous = bitmap[x / 64]得到了對應long的舊值,1L<<x等效於1L<<(x%64),即把對應位置的bit置爲1,再和舊值做位或,得到新值
爲什麼1L<<x等效於1L<<(x%64)呢?我們看一下官方說明15.19. Shift Operators

If the promoted type of the left-hand operand is int, then only the five lowest-order bits of the right-hand operand are used as the shift distance. It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x1f (0b11111). The shift distance actually used is therefore always in the range 0 to 31, inclusive.
If the promoted type of the left-hand operand is long, then only the six lowest-order bits of the right-hand operand are used as the shift distance. It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x3f (0b111111). The shift distance actually used is therefore always in the range 0 to 63, inclusive.

如果左操作數是int,那麼只會使用右操作數的低5位用於移位,因此實際移位距離永遠在0-31之間
如果左操作數是long,那麼只會使用右操作數的低6位用於移位,因此實際移位距離永遠在0-63之間

所以1L<<x等效於1L<<(x%64)


  • RunContainer添加過程
  @Override
  public Container add(short k) {
    // TODO: it might be better and simpler to do return
    // toBitmapOrArrayContainer(getCardinality()).add(k)
    // but note that some unit tests use this method to build up test runcontainers without calling
    // runOptimize
    // 同樣使用二分查找+順序查找,唯一區別是每隔2個查詢一次,這是爲了查詢起始值
    int index = unsignedInterleavedBinarySearch(valueslength, 0, nbrruns, k);
    // 大於等於0說明k就是某個起始值,已經存在,直接返回
    if (index >= 0) {
      return this;// already there
    }
    // 小於0說明k不是起始值,需要進一步判斷
    // 指向前一個起始值(即小於當前值的一個起始值)的索引
    index = -index - 2;// points to preceding value, possibly -1
    // 前一個起始值的索引大於0說明當前值不是最小值
    if (index >= 0) {// possible match
      // 計算當前值和前一個起始值的偏移量
      int offset = toIntUnsigned(k) - toIntUnsigned(getValue(index));
      // 計算前一個起始值的行程長度
      int le = toIntUnsigned(getLength(index));
      // 若偏移量小於前面的行程長度說明當前值在這個行程範圍內,直接返回
      if (offset <= le) {
        return this;
      }
      // 若偏移量等於行程長度+1,說明當前值是上一個行程最大值+1
      if (offset == le + 1) {
        // we may need to fuse
        // 說明前一個值並不是最後一個行程,那麼有可能需要融合前後兩個行程
        if (index + 1 < nbrruns) {
          // 若下一個行程的起始值等於當前值+1則需要將這兩個相鄰的行程做融合
          if (toIntUnsigned(getValue(index + 1)) == toIntUnsigned(k) + 1) {
            // indeed fusion is needed
            // 重置行程長度
            setLength(index,
                (short) (getValue(index + 1) + getLength(index + 1) - getValue(index)));
            // 通過數組拷貝將多餘的行程範圍刪除並將行程數量nbrruns-1
            recoverRoomAtIndex(index + 1);
            return this;
          }
        }
        // 若不是融合則將上一個行程的長度+1即可
        incrementLength(index);
        return this;
      }
      // 若當前值後還有一個行程,則可能需要將當前值和下一個行程融合
      if (index + 1 < nbrruns) {
        // we may need to fuse
        // 若下一個行程起始值等於當前值+1則需要將當前值和下一個行程融合
        if (toIntUnsigned(getValue(index + 1)) == toIntUnsigned(k) + 1) {
          // indeed fusion is needed
          // 重置起始值以及行程長度
          setValue(index + 1, k);
          setLength(index + 1, (short) (getLength(index + 1) + 1));
          return this;
        }
      }
    }
    // 前一個起始值的索引等於-1說明當前值是最小值
    if (index == -1) {
      // we may need to extend the first run
      // 若存在行程且最小值等於當前值+1,則重置起始值以及行程長度
      if (0 < nbrruns) {
        if (getValue(0) == k + 1) {
          incrementLength(0);
          decrementValue(0);
          return this;
        }
      }
    }
    // 其他情況通用處理
    makeRoomAtIndex(index + 1);
    setValue(index + 1, k);
    setLength(index + 1, (short) 0);
    return this;
  }

RunBitmap的添加過程相比其他兩種較爲複雜,優先處理了很多特殊情況再做通用處理

②轉化

當我們手動調用runOptimize()方法時會觸發對Container的優化,根據實際情況將ArrayContainer、BitmapContainer兩種容器與RunContainer相互轉化

官方論文關於容器轉化的說明

Thus, when first creating a Roaring bitmap, it is usually made of array and bitmap containers.
Runs are not compressed. Upon request, the storage of the Roaring bitmap can be optimized using
the runOptimize function. This triggers a scan through the array and bitmap containers that
converts them, if helpful, to run containers. In a given application, this might be done prior to
storing the bitmaps as immutable objects to be queried. Run containers may also arise from calling
a function to add a range of values.
To decide the best container type, we are motivated to minimize storage. In serialized form, a run
container uses 2 + 4r bytes given r runs, a bitmap container always uses 8192 bytes and an array
container uses 2c + 2 bytes, where c is the cardinality.
Therefore, we apply the following rules:

  • All array containers are such that they use no more space than they would as a bitmap
    container: they contain no more than 4096 values.
  • Bitmap containers use less space than they would as array containers: they contain more than
    4096 values.
  • A run container is only allowed to exist if it is smaller than either the array container or
    the bitmap container that could equivalently store the same values. If the run container has
    cardinality greater than 4096 values, then it must contain no more than ⌈(8192 − 2)/4⌉ =
    2047 runs. If the run container has cardinality no more than 4096, then the number of runs
    must be less than half the cardinality.
  /**
   * Use a run-length encoding where it is more space efficient
   *
   * @return whether a change was applied
   */
  public boolean runOptimize() {
    boolean answer = false;
    for (int i = 0; i < this.highLowContainer.size(); i++) {
      Container c = this.highLowContainer.getContainerAtIndex(i).runOptimize();
      if (c instanceof RunContainer) {
        answer = true;
      }
      this.highLowContainer.setContainerAtIndex(i, c);
    }
    return answer;
  }

我們看一下不同Container的轉化過程

  • ArrayContainer
  @Override
  public Container runOptimize() {
    // TODO: consider borrowing the BitmapContainer idea of early
    // abandonment
    // with ArrayContainers, when the number of runs in the arrayContainer
    // passes some threshold based on the cardinality.
    int numRuns = numberOfRuns();
    int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);
    if (getArraySizeInBytes() > sizeAsRunContainer) {
      return new RunContainer(this, numRuns); // this could be maybe
                                              // faster if initial
                                              // container is a bitmap
    } else {
      return this;
    }
  }

計算轉化爲RunContainer後需要的行程個數

  @Override
  int numberOfRuns() {
    if (cardinality == 0) {
      return 0; // should never happen
    }
    int numRuns = 1;
    int oldv = toIntUnsigned(content[0]);
    // 循環所有數字,若前後不連續則行程長度+1
    for (int i = 1; i < cardinality; i++) {
      int newv = toIntUnsigned(content[i]);
      if (oldv + 1 != newv) {
        ++numRuns;
      }
      oldv = newv;
    }
    return numRuns;
  }

計算作爲RunContainer序列化後的字節數

  protected static int serializedSizeInBytes(int numberOfRuns) {
    return 2 + 2 * 2 * numberOfRuns; // each run requires 2 2-byte entries.
  }

計算ArrayContainer大小進行對比

  @Override
  protected int getArraySizeInBytes() {
    return cardinality * 2;
  }

構造RunContainer

  protected RunContainer(ArrayContainer arr, int nbrRuns) {
    this.nbrruns = nbrRuns;
    // 長度爲行程個數的2倍
    valueslength = new short[2 * nbrRuns];
    if (nbrRuns == 0) {
      return;
    }

    int prevVal = -2;
    int runLen = 0;
    int runCount = 0;

	// 循環每個元素,判斷前後是否連續並設置起始值和行程長度
    for (int i = 0; i < arr.cardinality; i++) {
      int curVal = toIntUnsigned(arr.content[i]);
      if (curVal == prevVal + 1) {
        ++runLen;
      } else {
        if (runCount > 0) {
          setLength(runCount - 1, (short) runLen);
        }
        setValue(runCount, (short) curVal);
        runLen = 0;
        ++runCount;
      }
      prevVal = curVal;
    }
    setLength(runCount - 1, (short) runLen);
  }

  • BitmapContainer
  @Override
  public Container runOptimize() {
    int numRuns = numberOfRunsLowerBound(MAXRUNS); // decent choice

    int sizeAsRunContainerLowerBound = RunContainer.serializedSizeInBytes(numRuns);

    if (sizeAsRunContainerLowerBound >= getArraySizeInBytes()) {
      return this;
    }
    // else numRuns is a relatively tight bound that needs to be exact
    // in some cases (or if we need to make the runContainer the right
    // size)
    numRuns += numberOfRunsAdjustment();
    int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);

    if (getArraySizeInBytes() > sizeAsRunContainer) {
      return new RunContainer(this, numRuns);
    } else {
      return this;
    }
  }

計算runs的下界

  // nruns value for which RunContainer.serializedSizeInBytes ==
  // BitmapContainer.getArraySizeInBytes()
  private final int MAXRUNS = (getArraySizeInBytes() - 2) / 4;

  @Override
  protected int getArraySizeInBytes() {
    return MAX_CAPACITY / 8;
  }

  /**
   * Counts how many runs there is in the bitmap, up to a maximum
   *
   * @param mustNotExceed maximum of runs beyond which counting is pointless
   * @return estimated number of courses
   */
  public int numberOfRunsLowerBound(int mustNotExceed) {
    int numRuns = 0;

    for (int blockOffset = 0; blockOffset + BLOCKSIZE <= bitmap.length; blockOffset += BLOCKSIZE) {

      for (int i = blockOffset; i < blockOffset + BLOCKSIZE; i++) {
        long word = bitmap[i];
        numRuns += Long.bitCount((~word) & (word << 1));
      }
      if (numRuns > mustNotExceed) {
        return numRuns;
      }
    }
    return numRuns;
  }

MAXRUNS:我們知道BitmapContainer大小固定爲8kb即8192字節,我們就可以列一個等式去計算行程長度的最大數量,若超過這個值RunContainer佔用更大空間,沒有轉化的意義。2 + 2 * 2 * runs=8192,求得臨界值爲2047
計算下界的過程不是很複雜,我們這裏看一個有趣的算法numRuns += Long.bitCount((~word) & (word << 1)),很明顯這是計算runs的數量的。我們來理解下原理:一個word裏64位每個1代表一個數字,要統計run的數量其實就是統計這個word裏有多少組連續的1,再進一步說就是統計有多少個0、1是相鄰的,我們將word左移一位,再和word的反值做位與操作,如果有相鄰的0和1,則計算後會出現一個1,我們再通過Long.bitCount(long i)方法統計計算結果有多少個1,即求得有多少個runs
這個runs是一個粗略值,若通過這個runs計算出來的RunContainer大小小於BitmapContainer,則進一步計算準確的runs

對run值進行校準

  /**
   * Computes the number of runs
   *
   * @return the number of runs
   */
  public int numberOfRunsAdjustment() {
    int ans = 0;
    long nextWord = bitmap[0];
    for (int i = 0; i < bitmap.length - 1; i++) {
      final long word = nextWord;

      nextWord = bitmap[i + 1];
      ans += ((word >>> 63) & ~nextWord);
    }
    final long word = nextWord;

    if ((word & 0x8000000000000000L) != 0) {
      ans++;
    }
    return ans;
  }

ans += ((word >>> 63) & ~nextWord);:這個算法不難理解,需要調整的情況就是前一個值的第63位和下一個值的第0位不相同,這種情況runs需要加一

構造RunContainer

  // convert a bitmap container to a run container somewhat efficiently.
  protected RunContainer(BitmapContainer bc, int nbrRuns) {
    this.nbrruns = nbrRuns;
    valueslength = new short[2 * nbrRuns];
    if (nbrRuns == 0) {
      return;
    }

    int longCtr = 0; // index of current long in bitmap
    long curWord = bc.bitmap[0]; // its value
    int runCount = 0;
    while (true) {
      // potentially multiword advance to first 1 bit
      while (curWord == 0L && longCtr < bc.bitmap.length - 1) {
        curWord = bc.bitmap[++longCtr];
      }

      if (curWord == 0L) {
        // wrap up, no more runs
        return;
      }
      int localRunStart = Long.numberOfTrailingZeros(curWord);
      int runStart = localRunStart + 64 * longCtr;
      // stuff 1s into number's LSBs
      long curWordWith1s = curWord | (curWord - 1);

      // find the next 0, potentially in a later word
      int runEnd = 0;
      while (curWordWith1s == -1L && longCtr < bc.bitmap.length - 1) {
        curWordWith1s = bc.bitmap[++longCtr];
      }

      if (curWordWith1s == -1L) {
        // a final unterminated run of 1s (32 of them)
        runEnd = 64 + longCtr * 64;
        setValue(runCount, (short) runStart);
        setLength(runCount, (short) (runEnd - runStart - 1));
        return;
      }
      int localRunEnd = Long.numberOfTrailingZeros(~curWordWith1s);
      runEnd = localRunEnd + longCtr * 64;
      setValue(runCount, (short) runStart);
      setLength(runCount, (short) (runEnd - runStart - 1));
      runCount++;
      // now, zero out everything right of runEnd.
      curWord = curWordWith1s & (curWordWith1s + 1);
      // We've lathered and rinsed, so repeat...
    }
  }

  • RunContainer
    根據大小判斷轉化爲ArrayContainer或BitmapContainer,代碼不詳述了

4、常用API

RoaringBitmap:

//添加單個數字
public void add(final int x)

//添加範圍數字
public void add(final long rangeStart, final long rangeEnd)

//移除數字
public void remove(final int x)

//遍歷RBM
public void forEach(IntConsumer ic)

//檢測是否包含
public boolean contains(final int x)

//獲取基數
public int getCardinality()

//位與,取兩個RBM的交集,當前RBM會被修改
public void and(final RoaringBitmap x2)

//同上,但是會返回一個新的RBM,不會修改原始的RBM,線程安全
public static RoaringBitmap and(final RoaringBitmap x1, final RoaringBitmap x2)

//位或,取兩個RBM的並集,當前RBM會被修改
public void or(final RoaringBitmap x2)

//同上,但是會返回一個新的RBM,不會修改原始的RBM,線程安全
public static RoaringBitmap or(final RoaringBitmap x1, final RoaringBitmap x2)

//異或,取兩個RBM的對稱差,當前RBM會被修改
public void xor(final RoaringBitmap x2)

//同上,但是會返回一個新的RBM,不會修改原始的RBM,線程安全
public static RoaringBitmap xor(final RoaringBitmap x1, final RoaringBitmap x2)

//取原始值和x2的差集,當前RBM會被修改
public void andNot(final RoaringBitmap x2)

//同上,但是會返回一個新的RBM,不會修改原始的RBM,線程安全
public static RoaringBitmap andNot(final RoaringBitmap x1, final RoaringBitmap x2)

//序列化
public void serialize(DataOutput out) throws IOException
public void serialize(ByteBuffer buffer)

//反序列化
public void deserialize(DataInput in) throws IOException
public void deserialize(ByteBuffer bbf) throws IOException

FastAggregation: 一些快速聚合操作

public static RoaringBitmap and(Iterator<? extends RoaringBitmap> bitmaps)

public static RoaringBitmap or(Iterator<? extends RoaringBitmap> bitmaps)

public static RoaringBitmap xor(Iterator<? extends RoaringBitmap> bitmaps)

5、序列化

將RBM序列化爲字節數組並存入mysql

	RoaringBitmap roaringBitmap = new RoaringBitmap();
	roaringBitmap.add(1L,100L);
	int size = roaringBitmap.serializedSizeInBytes();
	ByteBuffer byteBuffer = ByteBuffer.allocate(size);
	roaringBitmap.serialize(byteBuffer);
	return byteBuffer.array();

序列化爲字節數組後,存入mysql的blob字段

將mysql中的blob字段反序列化爲RBM

    private RoaringBitmap deSerializeRoaringBitmap(Blob blob) throws SQLException {
        byte[] content = blob.getBytes(1, (int) blob.length());
        ByteBuffer byteBuffer = ByteBuffer.wrap(content);
        return new RoaringBitmap(new ImmutableRoaringBitmap(byteBuffer));
    }

官方給出的結合Kryo5序列化

Many applications use Kryo for serialization/deserialization. One can use Roaring bitmaps with Kryo efficiently thanks to a custom serializer (Kryo 5):

public class RoaringSerializer extends Serializer<RoaringBitmap> {
    @Override
    public void write(Kryo kryo, Output output, RoaringBitmap bitmap) {
        try {
            bitmap.serialize(new KryoDataOutput(output));
        } catch (IOException e) {
            e.printStackTrace();
            throw new RuntimeException();
        }
    }
    @Override
    public RoaringBitmap read(Kryo kryo, Input input, Class<? extends RoaringBitmap> type) {
        RoaringBitmap bitmap = new RoaringBitmap();
        try {
            bitmap.deserialize(new KryoDataInput(input));
        } catch (IOException e) {
            e.printStackTrace();
            throw new RuntimeException();
        }
        return bitmap;
    }

}

在項目中,我們採用了FST序列化,發現FSTStreamEncoder與RoaringBitmap的序列化有衝突,這是因爲FSTStreamEncoder會對Int、Short、Long等類型做變長存儲處理(如,根據Int的大小存儲爲1-5個字節),而RoaringBitmap的序列化與反序列化依賴於這些類型的序列化的字節數是固定的,所以需要自定義處理
將RoaringBitmap綁定自定義的序列化器進行註冊,在自定義的序列化器中我們對FSTObjectOutput以及FSTObjectInput進行代理,重寫writeShortwriteIntwriteLong等方法,將其路由爲寫字節數組


四、一些補充

1、RBM對long的支持:

64-bit integers (long)
Though Roaring Bitmaps were designed with the 32-bit case in mind, we have an extension to 64-bit integers:

      import org.roaringbitmap.longlong.*;

      LongBitmapDataProvider r = Roaring64NavigableMap.bitmapOf(1,2,100,1000);
      r.addLong(1234);
      System.out.println(r.contains(1)); // true
      System.out.println(r.contains(3)); // false
      LongIterator i = r.getLongIterator();
      while(i.hasNext()) System.out.println(i.next());

2、0.8.12版本中將所有的unsigned shorts替換爲chars

本博客的所有源碼基於0.8.11版本,所以還是short
在這裏插入圖片描述
在這裏插入圖片描述
有個哥們提了個issue,他將所有的unsigned short都替換爲了char,並且去除了所有toIntUnsigned以及compareUnsigned方法。這個想法得到了作者的認可並且merge進了主分支
在這裏插入圖片描述
關於性能,他用真實數據測試了foreach和or操作,性能上有不小的提升,但也有一些有所回退,根據目前最新版本0.8.14該功能未回退,應該沒有太大影響


參考文檔

RoaringBitmap github
《Better bitmap performance with Roaring bitmaps》
Consistently faster and smaller compressed bitmaps with Roaring
高效壓縮位圖RoaringBitmap的原理與應用
精確去重和Roaring BitMap (咆哮位圖)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章