轉自：http://blog.huang-wei.com/2010/11/02/bloom-filter/

介紹

Bloom Filter是一種簡單的節省空間的隨機化的數據結構，支持用戶查詢的集合。一般我們使用STL的std::set, stdext::hash_set，std::set是用紅黑樹實現的，stdext::hash_set是用桶式哈希表。上述兩種數據結構，都會需要保存原始數據信息，當數據量較大時，內存就會是個問題。如果應用場景中允許出現一定機率的誤判，且不需要逆向遍歷集合中的數據時，Bloom Filter是很好的結構。

優點

查詢操作十分高效。
節省空間。
易於擴展成並行。
集合計算方便。
代碼實現方便。
有誤判的概率，即存在False Position。
無法獲取集合中的元素數據。
不支持刪除操作。

缺點

有誤判的概率，即存在False Position。
無法獲取集合中的元素數據。
不支持刪除操作。

定義

Bloom Filter是一個有m位的位數組，初始全爲0，並有k個各自獨立的哈希函數。

圖1

添加操作

每個元素，用k個哈希函數計算出大小爲k的哈希向量 $/bg_white /left (H_{1},H_{2}/cdots ,H_{k} /right )$
，將向量裏的每個哈希值對應的位設置爲1。時間複雜度爲 $/bg_white k/cdot O(F_{H})$ ，一般字符串哈希函數的時間複雜度也就是。

查詢操作

和添加類似，先計算出哈希向量，如果每個哈希值對應的位都爲1，則該元素存在。時間複雜度與添加操作相同。

示例

圖2表示m=16，k=2的Bloom Filter，和的哈希值分別爲(3, 6)和(10, 3)。

圖2

False Position

如果某元素不在Bloom Filter中，但是它所有哈希值的位置均被設爲1。這種情況就是False Position，也就是誤判。

借用示例，如下：

圖3

這個問題其實和哈希表中的衝突是相同的道理，哈希表中可以使用開散列和閉散列的方法，而Bloom Filter則允許這樣的情況發生，它更關心於誤判的發生概率。

概率

宏觀上，我們能得出以下結論：

參數表	變量	減少	增加
哈希函數總數	K	l 更少的哈希值計算 l 增加False Position的概率	l 更多的計算 l 位值0減少
Bloom Filter大小	M	l 更少的內存 l 增加False Position的概率	l 更多的內存 l 降低概率
元素總數	N	l 降低False Position的概率	l 增加概率

False Position的概率爲 $F=(1-e^{-/frac{kn}{m}})^{k}$ 。

假設m和n已知，爲了最小化False Position，則 $/bg_white k=/left [ /ln 2/cdot /frac{m}{n} /right ]$ 。

數據

圖4

擴展

Counter Bloom Filter

Bloom Filter有個缺點，就是不支持刪除操作，因爲它不知道某一個位從屬於哪些向量。那我們可以給Bloom Filter加上計數器，添加時增加計數器，刪除時減少計數器。

但這樣的Filter需要考慮附加的計數器大小，假如同個元素多次插入的話，計數器位數較少的情況下，就會出現溢出問題。如果對計數器設置上限值的話，會導致Cache Miss，但對某些應用來說，這並不是什麼問題，如Web Sharing。

Compressed Bloom Filter

爲了能在服務器之間更快地通過網絡傳輸Bloom Filter，我們有方法能在已完成Bloom Filter之後，得到一些實際參數的情況下進行壓縮。

將元素全部添加入Bloom Filter後，我們能得到真實的空間使用率，用這個值代入公式計算出一個比m小的值，重新構造Bloom Filter，對原先的哈希值進行求餘處理，在誤判率不變的情況下，使得其內存大小更合適。

應用

加速查詢

適用於一些key-value存儲系統，當values存在硬盤時，查詢就是件費時的事。

將Storage的數據都插入Filter，在Filter中查詢都不存在時，那就不需要去Storage查詢了。

當False Position出現時，只是會導致一次多餘的Storage查詢。

圖5

l Google的BigTable也使用了Bloom Filter，以減少不存在的行或列在磁盤上的查詢，大大提高了數據庫的查詢操作的性能。

l 在Internet Cache Protocol中的Proxy-Cache很多都是使用Bloom Filter存儲URLs，除了高效的查詢外，還能很方便得傳輸交換Cache信息。

網絡應用

l P2P網絡中查找資源操作，可以對每條網絡通路保存Bloom Filter，當命中時，則選擇該通路訪問。

l 廣播消息時，可以檢測某個IP是否已發包。

l 檢測廣播消息包的環路，將Bloom Filter保存在包裏，每個節點將自己添加入Bloom Filter。

l 信息隊列管理，使用Counter Bloom Filter管理信息流量。

垃圾郵件地址過濾

來自於Google黑板報的例子。

像網易，QQ這樣的公衆電子郵件（email）提供商，總是需要過濾來自發送垃圾郵件的人（spamer）的垃圾郵件。

一個辦法就是記錄下那些發垃圾郵件的 email 地址。由於那些發送者不停地在註冊新的地址，全世界少說也有幾十億個發垃圾郵件的地址，將他們都存起來則需要大量的網絡服務器。

如果用哈希表，每存儲一億個 email 地址，就需要 1.6GB 的內存（用哈希表實現的具體辦法是將每一個 email 地址對應成一個八字節的信息指紋，然後將這些信息指紋存入哈希表，由於哈希表的存儲效率一般只有 50%，因此一個 email 地址需要佔用十六個字節。一億個地址大約要 1.6GB，即十六億字節的內存）。因此存貯幾十億個郵件地址可能需要上百 GB 的內存。

而Bloom Filter只需要哈希表 1/8 到 1/4 的大小就能解決同樣的問題。

Bloom Filter決不會漏掉任何一個在黑名單中的可疑地址。而至於誤判問題，常見的補救辦法是在建立一個小的白名單，存儲那些可能別誤判的郵件地址。

引用

[1] Bloom filter; http://en.wikipedia.org/wiki/Bloom_filter

[2] Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol; http://pages.cs.wisc.edu/~cao/papers/summary-cache/

[3] Network Applications of Bloom Filters: A Survey; http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.9672&rep=rep1&type=pdf

[4] An Examination of Bloom Filters and their Applications; http://cs.unc.edu/~fabian/courses/CS600.624/slides/bloomslides.pdf

[5] 數學之美系列二十一－布隆過濾器（Bloom Filter）; http://www.google.com.hk/ggblog/googlechinablog/2007/07/bloom-filter_7469.html

/* * File: bloomfilter.h * Created: 2010/10/31 * Author: Huang.WisKey * E-Mail: sir.huangwei[at]gmail.com * Brief: * * May you do good and not evil. * May you find forgiveness for yourself and forgive others. * May you share freely, never taking more than you give. */ #pragma once #ifndef __BLOOMFILTER_H__ #define __BLOOMFILTER_H__ #include "stdlib.h" #include "memory.h" #include "time.h" #include "math.h" #ifndef NULL # ifdef __cplusplus # define NULL 0 # else # define NULL ((void *)0) # endif #endif unsigned int string_SDBM_hash(const char* _str); unsigned int string_RS_hash(const char* _str); unsigned int string_JS_hash(const char* _str); unsigned int string_PJW_hash(const char* _str); unsigned int string_ELF_hash(const char* _str); unsigned int string_BKDR_hash(const char* _str); unsigned int string_DJB_hash(const char* _str); unsigned int string_AP_hash(const char* _str); template <typename T> class bloomfilter { public: typedef unsigned int hash_key; typedef hash_key (*hash_func_type)(const T&); typedef unsigned int cell_type; typedef unsigned int size_type; protected: static hash_key default_hash_func(const T& _obj) { return string_AP_hash(reinterpret_cast <const char*> (_obj)); } public: bloomfilter(size_type _elem_size, double _prob_false_positive, unsigned int _rand_seed = static_cast <unsigned int> (time(NULL)), hash_func_type _hash_func = default_hash_func ) : bit_table_(NULL), table_size_(0), hash_func_(_hash_func ? _hash_func : default_hash_func), elem_bit_size_(0), elem_bit_randoms_(NULL), elem_size_(0), randoms_seed_(_rand_seed) { _optimal_parameters(_elem_size, _prob_false_positive); _generate_random(); bit_table_ = new cell_type[table_size_ / sizeof(cell_type)]; memset(bit_table_, 0, table_size_); } bloomfilter(const bloomfilter& _obj) { // bit table table_size_ = _obj.table_size_; bit_table_ = new cell_type[table_size_ / sizeof(cell_type)]; memcpy(bit_table_, _obj.bit_table_, table_size_); // hash func hash_func_ = _obj.hash_func_; // elem elem_bit_size_ = _obj.elem_bit_size_; elem_bit_randoms_ = new hash_key[elem_bit_size_]; memcpy(elem_bit_randoms_, _obj.elem_bit_randoms_, elem_bit_size_ * sizeof(hash_key)); elem_size_ = _obj.elem_size_; } virtual ~bloomfilter() { delete[] bit_table_; bit_table_ = NULL; delete[] elem_bit_randoms_; elem_bit_randoms_ = NULL; hash_func_ = NULL; table_size_ = 0; hash_func_ = NULL; elem_bit_size_ = 0; elem_size_ = 0; } void insert(const T& _obj) { hash_key p, b; hash_key hash = hash_func_(_obj); bool exist = true; for (unsigned int i = 0; i < elem_bit_size_; i ++) { _get_pos(elem_bit_randoms_[i] * hash, p, b); exist = exist && (bit_table_[p] & (0x01L<<b)); bit_table_[p] |= 0x01L << b; } elem_size_ += ! exist; } bool find(const T& _obj) { hash_key p, b; hash_key hash = hash_func_(_obj); bool exist = true; for (unsigned int i = 0; i < elem_bit_size_ && exist; i ++) { _get_pos(elem_bit_randoms_[i] * hash, p, b); exist = exist && (bit_table_[p] & (0x01L<<b)); } return exist; } double effective_fpp() const { /* Note: The effective false positive probability is calculated using the designated table size and hash function count in conjunction with the current number of inserted elements - not the user defined predicated/expected number of inserted elements. */ return pow(1.0 - exp(-1.0 * elem_bit_size_ * elem_size_ / table_size_), 1.0 * elem_bit_size_); } size_type size() { /* in bytes */ return table_size_; } size_type cell_size() { return table_size_ / sizeof(cell_type); } size_type count() { return elem_size_; } void clear() { memset(bit_table_, 0, table_size_); elem_size_ = 0; } const cell_type* table() { return bit_table_; } bloomfilter& operator &= (const bloomfilter& _obj) { /* intersection */ if ( (elem_bit_size_ == _obj.elem_bit_size_) && (table_size_ == _obj.table_size_) && (randoms_seed_ == _obj.randoms_seed_) && (hash_func_ == _obj.hash_func_) ) { size_type cells = cell_size(); for (size_type i = 0; i < cells; ++i) bit_table_[i] &= _obj.bit_table_[i]; } return *this; } bloomfilter& operator |= (const bloomfilter& _obj) { /* union */ if ( (elem_bit_size_ == _obj.elem_bit_size_) && (table_size_ == _obj.table_size_) && (randoms_seed_ == _obj.randoms_seed_) && (hash_func_ == _obj.hash_func_) ) { size_type cells = cell_size(); for (size_type i = 0; i < cells; ++i) bit_table_[i] |= _obj.bit_table_[i]; } return *this; } bloomfilter& operator ^= (const bloomfilter& _obj) { /* difference */ if ( (elem_bit_size_ == _obj.elem_bit_size_) && (table_size_ == _obj.table_size_) && (randoms_seed_ == _obj.randoms_seed_) && (hash_func_ == _obj.hash_func_) ) { size_type cells = cell_size(); for (size_type i = 0; i < cells; ++i) bit_table_[i] ^= _obj.bit_table_[i]; } return *this; } protected: void _optimal_parameters(unsigned int _elem_size_prob, double _prob_false_positive) { /* Note: The following will attempt to find the number of hash functions and minimum amount of storage bits required to construct a bloom _obj consistent with the user defined false positive probability and estimated element insertion count. */ double min_m = 1e99; double min_k = 0.0; double curr_m = 0.0; for(double k = 0.0; k < 1000.0; ++k) { if ((curr_m = ((- k * _elem_size_prob) / log(1.0 - pow(_prob_false_positive, 1.0 / k)))) < min_m) { min_m = curr_m; min_k = k; } } elem_bit_size_ = static_cast <size_type> (min_k); table_size_ = static_cast <size_type> (min_m); table_size_ = ((table_size_ > _elem_size_prob ? table_size_ : _elem_size_prob) / sizeof(cell_type) + 1) * sizeof(cell_type); } void _generate_random() { elem_bit_randoms_ = new hash_key[elem_bit_size_]; srand(randoms_seed_); for (unsigned int i = 0; i < elem_bit_size_; i ++) { elem_bit_randoms_[i] = rand(); } } void _get_pos(hash_key _hash, hash_key& _cell, hash_key& _bit) { _hash %= table_size_; _cell = _hash / sizeof(hash_key); _bit = _hash % sizeof(hash_key); } protected: cell_type* bit_table_; size_type table_size_; hash_func_type hash_func_; size_type elem_bit_size_; hash_key* elem_bit_randoms_; size_type elem_size_; unsigned int randoms_seed_; }; // SDBM Hash Function unsigned int string_SDBM_hash(const char* _str) { unsigned int hash = 0; while (*_str) { // equivalent to: hash = 65599*hash + (*_str++); hash = (*_str++) + (hash << 6) + (hash << 16) - hash; } return (hash & 0x7FFFFFFF); } // RS Hash Function unsigned int string_RS_hash(const char* _str) { unsigned int b = 378551; unsigned int a = 63689; unsigned int hash = 0; while (*_str) { hash = hash * a + (*_str++); a *= b; } return (hash & 0x7FFFFFFF); } // JS Hash Function unsigned int string_JS_hash(const char* _str) { unsigned int hash = 1315423911; while (*_str) { hash ^= ((hash << 5) + (*_str++) + (hash >> 2)); } return (hash & 0x7FFFFFFF); } // P. J. Weinberger Hash Function unsigned int string_PJW_hash(const char* _str) { unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8); unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt * 3) / 4); unsigned int OneEighth = (unsigned int)(BitsInUnignedInt / 8); unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth); unsigned int hash = 0; unsigned int test = 0; while (*_str) { hash = (hash << OneEighth) + (*_str++); if ((test = hash & HighBits) != 0) { hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits)); } } return (hash & 0x7FFFFFFF); } // ELF Hash Function unsigned int string_ELF_hash(const char* _str) { unsigned int hash = 0; unsigned int x = 0; while (*_str) { hash = (hash << 4) + (*_str++); if ((x = hash & 0xF0000000L) != 0) { hash ^= (x >> 24); hash &= ~x; } } return (hash & 0x7FFFFFFF); } // BKDR Hash Function unsigned int string_BKDR_hash(const char* _str) { unsigned int seed = 131; // 31 131 1313 13131 131313 etc.. unsigned int hash = 0; while (*_str) { hash = hash * seed + (*_str++); } return (hash & 0x7FFFFFFF); } // DJB Hash Function unsigned int string_DJB_hash(const char* _str) { unsigned int hash = 5381; while (*_str) { hash += (hash << 5) + (*_str++); } return (hash & 0x7FFFFFFF); } // AP Hash Function unsigned int string_AP_hash(const char* _str) { unsigned int hash = 0; for (int i=0; *_str; i++) { if ((i & 1) == 0) hash ^= ((hash << 7) ^ (*_str++) ^ (hash >> 3)); else hash ^= (~((hash << 11) ^ (*_str++) ^ (hash >> 5))); } return (hash & 0x7FFFFFFF); } #endif // __BLOOMFILTER_H__

Bloom Filter 原理與應用

介紹

優點

缺點

定義

添加操作

查詢操作

示例

False Position

概率

數據

擴展

Counter Bloom Filter

Compressed Bloom Filter

應用

加速查詢

網絡應用

垃圾郵件地址過濾

引用

如何使用 JS 判斷用戶是否處於活躍狀態

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

Android研究之Activity組件

wordpress站點的統計

android度量相對於地球座標系的加速度

幾道面試到的算法題

數據庫插入百萬數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結