數據流過程中一定大小窗口的topK問題

我們經常在大數據問題中遇到topK,但這裏我們討論的是在數據流的場景下數據中的topK,本人將在下面提出一些設計方式。

示意圖如下:
這裏寫圖片描述
數據流流入處理模塊,模塊中初始化了最小堆和最大堆,維護兩個堆之間的關係和大小,保證模塊中爲最近一段時間內一定數量的數據,再通過堆的性質,獲取這些數據中的topK。

代碼實現如下:

import java.util.Comparator;
import java.util.PriorityQueue;
import java.util.Queue;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantLock;

public class TopKUtils<T> {

    static final int MAXIMUM_CAPACITY = 1 << 30;
    private ReentrantLock lock = new ReentrantLock(true); //公平鎖
    private int initialCapacity;
    private int minHeapNun;
    private int maxHeapNum;
    private int curIndex;
    private Comparator<T> comparator;
    private Queue<T> minHeap;
    private Queue<T> maxHeap;
    private Object[] dataList;

    /**
     * 構造函數
     * @param initialCapacity   容量大小
     * @param loadFactor    topK所在容量大小的百分比位置
     * @param comparator    比較函數
     */
    public TopKUtils(int initialCapacity, float loadFactor, Comparator<T> comparator) {
        //判斷參數是否有異常
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                    initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || loadFactor >= 1 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                    loadFactor);
        if (comparator == null)
            throw new IllegalArgumentException("Illegal comparator");
        this.initialCapacity = initialCapacity;
        this.minHeapNun = Math.round(initialCapacity * loadFactor);
        this.maxHeapNum = initialCapacity - minHeapNun;
        this.comparator = comparator;
        init();
    }

    /**
     * 初始化代碼
     */
    public void init() {
        final Comparator<T> comparator = this.comparator;
        //最大堆的比較函數,與最小堆相對應
        Comparator<T> AntiComparator = new Comparator<T>() {
            @Override
            public int compare(T c1, T c2) {
                return -comparator.compare(c1, c2);
            }
        };
        //初始化最小堆、最大堆,這裏用優先級隊列實現效果
        this.minHeap = new PriorityQueue<>(this.minHeapNun, comparator);
        this.maxHeap = new PriorityQueue<>(this.maxHeapNum, AntiComparator);
        //新建一個存儲之前容量大小數據的臨時空間,用於過期數據的剔除比較
        this.dataList = new Object[this.initialCapacity];
    }

    public void push(T t) {
        try {
            //加鎖,1s過期時間
            if (lock.tryLock(1, TimeUnit.SECONDS)) {
                try {
                    //獲取臨時數據中的當前位置數據
                    T curValue = (T) this.dataList[this.curIndex];
                    //curValue爲空,則大小堆未滿
                    if (curValue == null) {
                        if (this.minHeapNun > this.minHeap.size()) {
                            this.minHeap.add(t);
                        } else {
                            addInOrder(t);
                        }
                    } else {
                        //curValue不爲空則從小堆或者大堆中刪除這個過期數據
                        boolean isFromMinHeap = this.minHeap.remove(curValue);
                        if (isFromMinHeap) {
                            T peekMaxHeap = this.maxHeap.peek();
                            this.maxHeap.remove();
                            this.minHeap.add(peekMaxHeap);

                            addInOrder(t);
                        } else {
                            this.maxHeap.remove(curValue);

                            addInOrder(t);
                        }
                    }
                    //更新臨時數據及當前下標
                    this.dataList[this.curIndex] = t;
                    this.curIndex = (this.curIndex + 1) % this.initialCapacity;
                } finally {
                    lock.unlock();
                }
            }
        } catch (Exception e) {
            //這裏的異常不能使程序退出
            e.printStackTrace();
        }

    }

    public void addInOrder(T t) {
        //判斷新增的數據是否比最小堆的堆頂要大
        if (this.comparator.compare(this.minHeap.peek(), t) > 0) {
            this.maxHeap.add(t);
        } else {
            T peekMinHeap = this.minHeap.peek();
            this.minHeap.remove();
            this.minHeap.add(t);
            this.maxHeap.add(peekMinHeap);
        }
    }

    public T getTopK() {
        if (this.minHeap.size() > 0) {
            return this.minHeap.peek();
        } else if (this.maxHeap.size() > 0) {
            return this.maxHeap.peek();
        }
        return null;
    }

    public static void main(String[] args) {
        Comparator<Float> defaultComparator = new Comparator<Float>() {
            @Override
            public int compare(Float c1, Float c2) {
                return Float.compare(c1, c2);
            }
        };
        TopKUtils topKUtils = new TopKUtils<>(1000, 0.1f, defaultComparator);
        long start = System.currentTimeMillis();
        for (int i = 0; i < 100000; i++) {
            topKUtils.push((float) Math.random());
            float result = (float) topKUtils.getTopK();
//            System.out.println(result);
        }
        long end = System.currentTimeMillis();
        System.out.println("代碼運行時間:" + (end - start) + "ms");
    }
}

因爲優先級隊列刪除特定值操作的時間複雜度爲O(n),刪除堆頂的時間複雜度爲O(1),重排的時間複雜度爲O(logn),故每次操作的時間複雜度爲O(n)。

這裏還有改進的空間,可以維護一個大小爲預設容量的hashmap,鍵爲數據在最近一定數量的數據中的位置,值爲其在優先級隊列的下標,這樣刪除操作的時間複雜度可以下降到O(logn),整體時間複雜度爲O(logn),但是需要重寫優先級隊列,並維護hashmap中對應的位置,這裏僅僅提供一個思路。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章