堆排序計算詞頻的topk真的比快排快嗎

原創

2019-01-06 00:26

起初我一直以爲計算topk問題堆排序效率會更加的高(實在是太蠢了)，但是下面這段代碼輸出的時間差是相同的，這說明堆排序與快排在計算topk問題上所用的時間是相等的。
分析：對這個現象進行分析，首先快排的時間複雜度是O(n*log(n))，而對於堆排序分爲兩個過程，第一個過程是建堆過程，即下面代碼的get_topk()函數的第一個循環，因爲需要對從最後一個不爲葉子節點的節點(即索引n/2處)開始調整建堆，所以，時間複雜度爲O(n/2*log(n));第二個過程爲排序出前k,這部分的時間複雜度爲O(k*log(n)),所以堆排序計算topk的時間複雜度爲O(n/2*log(n))+O(k*log(n))=O(n*log(n))。其實跟快排是一樣的。
利弊，而在工程中我們通常需要頻繁計算某個文本的top-k此時，如果對於每個k都重新採用堆排序的第二個過程，時間複雜度會線性增加，還不如一次排序，以後就想要top多少就要top多少。但是所以nltk並沒有快排的這種方式求topk，

import math
import nltk
import jieba
import collections
import datetime
#start end 表示數組heap的索引
def shift(start,end,heap):
   per_head=start
   left_child_index=2*start+1
   max_index=left_child_index
   while max_index<=end:
       if max_index<end and heap[max_index][1]<heap[max_index+1][1]:
           max_index+=1
       if heap[per_head][1]<heap[max_index][1]:
           temp=heap[max_index];
           heap[max_index]=heap[per_head]
           heap[per_head]=temp;
           head=max_index
           max_index=2*head+1
       else:
           break;
def get_topk(heap,k):
    length=len(heap)
    start=math.ceil(length/2)+1;
    for i in range(0,start +1)[::-1]:
        shift(i,length-1,heap)
    for i in range(0,length)[::-1]:
        temp=heap[i]
        heap[i]=heap[0]
        heap[0]=temp
        if length-(i-1)==k:
            break;
        shift(0,i-1,heap)
    return heap[-k:]
def nltk_topk(file_path,k):
    raw=open(file_path).read()
    text=nltk.text.Text(jieba.lcut(raw))
    fdist1=nltk.FreqDist(text)
    fre_dist_list=list(fdist1.items())
    print(len(fre_dist_list))
    m=sorted(fre_dist_list,key=lambda x:x[1])
    return m[-k:]
def my_topk(file_path,k):
    raw=open(file_path).read()
    word_list=jieba.lcut(raw)
    vocabulary=list(collections.Counter(word_list).items())
    return get_topk(vocabulary,k)
file_path='D:/test.txt';
my_start_time=datetime.datetime.now()
(my_topk(file_path,1))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

堆排序計算詞頻的topk真的比快排快嗎

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

堆排序計算詞頻的topk真的比快排快嗎

js實現推箱子小遊戲

tensorfow模擬函數擬合過程

c#委託與多播委託delegate +=

手寫svm識別人是否戴眼鏡

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結