我們經常在大數據問題中遇到topK,但這裏我們討論的是在數據流的場景下數據中的topK,本人將在下面提出一些設計方式。
示意圖如下:
數據流流入處理模塊,模塊中初始化了最小堆和最大堆,維護兩個堆之間的關係和大小,保證模塊中爲最近一段時間內一定數量的數據,再通過堆的性質,獲取這些數據中的topK。
代碼實現如下:
import java.util.Comparator;
import java.util.PriorityQueue;
import java.util.Queue;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantLock;
public class TopKUtils<T> {
static final int MAXIMUM_CAPACITY = 1 << 30;
private ReentrantLock lock = new ReentrantLock(true); //公平鎖
private int initialCapacity;
private int minHeapNun;
private int maxHeapNum;
private int curIndex;
private Comparator<T> comparator;
private Queue<T> minHeap;
private Queue<T> maxHeap;
private Object[] dataList;
/**
* 構造函數
* @param initialCapacity 容量大小
* @param loadFactor topK所在容量大小的百分比位置
* @param comparator 比較函數
*/
public TopKUtils(int initialCapacity, float loadFactor, Comparator<T> comparator) {
//判斷參數是否有異常
if (initialCapacity < 0)
throw new IllegalArgumentException("Illegal initial capacity: " +
initialCapacity);
if (initialCapacity > MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
if (loadFactor <= 0 || loadFactor >= 1 || Float.isNaN(loadFactor))
throw new IllegalArgumentException("Illegal load factor: " +
loadFactor);
if (comparator == null)
throw new IllegalArgumentException("Illegal comparator");
this.initialCapacity = initialCapacity;
this.minHeapNun = Math.round(initialCapacity * loadFactor);
this.maxHeapNum = initialCapacity - minHeapNun;
this.comparator = comparator;
init();
}
/**
* 初始化代碼
*/
public void init() {
final Comparator<T> comparator = this.comparator;
//最大堆的比較函數,與最小堆相對應
Comparator<T> AntiComparator = new Comparator<T>() {
@Override
public int compare(T c1, T c2) {
return -comparator.compare(c1, c2);
}
};
//初始化最小堆、最大堆,這裏用優先級隊列實現效果
this.minHeap = new PriorityQueue<>(this.minHeapNun, comparator);
this.maxHeap = new PriorityQueue<>(this.maxHeapNum, AntiComparator);
//新建一個存儲之前容量大小數據的臨時空間,用於過期數據的剔除比較
this.dataList = new Object[this.initialCapacity];
}
public void push(T t) {
try {
//加鎖,1s過期時間
if (lock.tryLock(1, TimeUnit.SECONDS)) {
try {
//獲取臨時數據中的當前位置數據
T curValue = (T) this.dataList[this.curIndex];
//curValue爲空,則大小堆未滿
if (curValue == null) {
if (this.minHeapNun > this.minHeap.size()) {
this.minHeap.add(t);
} else {
addInOrder(t);
}
} else {
//curValue不爲空則從小堆或者大堆中刪除這個過期數據
boolean isFromMinHeap = this.minHeap.remove(curValue);
if (isFromMinHeap) {
T peekMaxHeap = this.maxHeap.peek();
this.maxHeap.remove();
this.minHeap.add(peekMaxHeap);
addInOrder(t);
} else {
this.maxHeap.remove(curValue);
addInOrder(t);
}
}
//更新臨時數據及當前下標
this.dataList[this.curIndex] = t;
this.curIndex = (this.curIndex + 1) % this.initialCapacity;
} finally {
lock.unlock();
}
}
} catch (Exception e) {
//這裏的異常不能使程序退出
e.printStackTrace();
}
}
public void addInOrder(T t) {
//判斷新增的數據是否比最小堆的堆頂要大
if (this.comparator.compare(this.minHeap.peek(), t) > 0) {
this.maxHeap.add(t);
} else {
T peekMinHeap = this.minHeap.peek();
this.minHeap.remove();
this.minHeap.add(t);
this.maxHeap.add(peekMinHeap);
}
}
public T getTopK() {
if (this.minHeap.size() > 0) {
return this.minHeap.peek();
} else if (this.maxHeap.size() > 0) {
return this.maxHeap.peek();
}
return null;
}
public static void main(String[] args) {
Comparator<Float> defaultComparator = new Comparator<Float>() {
@Override
public int compare(Float c1, Float c2) {
return Float.compare(c1, c2);
}
};
TopKUtils topKUtils = new TopKUtils<>(1000, 0.1f, defaultComparator);
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++) {
topKUtils.push((float) Math.random());
float result = (float) topKUtils.getTopK();
// System.out.println(result);
}
long end = System.currentTimeMillis();
System.out.println("代碼運行時間:" + (end - start) + "ms");
}
}
因爲優先級隊列刪除特定值操作的時間複雜度爲O(n),刪除堆頂的時間複雜度爲O(1),重排的時間複雜度爲O(logn),故每次操作的時間複雜度爲O(n)。
這裏還有改進的空間,可以維護一個大小爲預設容量的hashmap,鍵爲數據在最近一定數量的數據中的位置,值爲其在優先級隊列的下標,這樣刪除操作的時間複雜度可以下降到O(logn),整體時間複雜度爲O(logn),但是需要重寫優先級隊列,並維護hashmap中對應的位置,這裏僅僅提供一個思路。