Searchable Symmetric Encryption Scheme——對稱密文檢索

引言:

在IT界,大數據安全和密碼學的高級實現似乎很難找到,很簡單的一個例子是:倒排索引的實現有很多,但是在加密基礎上再次實現密文檢索和倒排索引卻是寥寥無幾,這篇博文基於對稱密文實現檢索。

數據集

真實數據集:
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Enron Emails,NIPS full papers,NYTimes news articles 用關鍵詞W對密文建立索引,對密文進行檢索 D=39861 代表文檔數目 W=28102代表單詞數目 N=6,400,000 (approx)代表單詞總數

問題描述

實現以下加密方案
在這裏插入圖片描述
在這裏插入圖片描述

環境

Python3.7
Pycharm professional

Cryptography(python密碼學算法庫)

Cryptography是Python提供的一個爲密碼學使用者提供便利的第三方庫,官方中,cryptography 的目標是成爲“人們易於使用的密碼學包”,就像 requests 是“人們易於使用的http庫”一樣。這個想法使你能夠創建簡單安全、易於使用的加密方案。如果有需要的話,你也可以使用一些底層的密碼學基元,但這也需要你知道更多的細節,否則創建的東西將是不安全的。

參考學習鏈接:
官網:
https://cryptography.io/en/latest/

Python3加密學習:
https://linux.cn/article-7676-1.html

安裝:
Pip install cryptography

調用(這裏只提供本文需要的包)
from cryptography.fernet import Fernet #used for the symmetric key generation
from cryptography.hazmat.backends import default_backend #used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC

dataset

數據集: 本次實驗採用University of California Irvine的數據集,Enron Emails,NIPS full papers,NYTimes news articles 用關鍵詞W對密文建立索引,對密文進行檢索 D=39861 代表文檔數目 W=28102代表單詞數目 N=6,400,000 (approx)代表單詞總數。
首先以Enron Emails數據爲例子,數據文件的數據以如下的形式呈現:

在這裏插入圖片描述

其中每一行的第一個數據是文件的編號,第二個數據是單詞的編號,第三個數據是詞頻,這種形式的數據其實爲我們在進行倒排索引的構建時提供了便利。

倒排索引word_dict的構建

這一部分是對數據集構建基本的倒排索引,採用傳統的建立方法就可以,核心代碼如下:

    for idx, val in enumerate(filenames):#val is the name of the file
        cnt = Counter()
        for line in open(filenames[idx], 'r'): 
            print(line)
            word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
            for word in word_list:
                cnt[word] += 1
        filedata.append((val, cnt)) 


    for i in allwords:
        word_dict[i]
        for idx, val in enumerate(filedata):
            if i in val[1]:#val[1] is allwords of the value
                word_dict[i].append(val[0])# val[0] is the name of file
        word_dict[i].sort()

首先我們通過兩個簡單的測試文件進行驗證:
example1:
File1:“hello this is a test data file data file file”
File2: “also file data is a test file”

將這兩個文件作爲輸入,可以看到輸出的倒排索引如下:word_dict
defaultdict(<class ‘list’>, {‘file’: [‘simple.txt’, ‘simple2.txt’], ‘data’: [‘simple.txt’, ‘simple2.txt’], ‘hello’: [‘simple.txt’], ‘this’: [‘simple.txt’], ‘is’: [‘simple.txt’, ‘simple2.txt’], ‘also’: [‘simple2.txt’], ‘a’: [‘simple.txt’, ‘simple2.txt’]})

example2:
有了這兩個的基礎,我們再對Enron Emails dataset進行word_list構建,考慮到Enron Emails dataset的數據量較大,難以從輸出上看到結構,我們只取其中的前10個文件對應的數據集運行,得到如下的word_dict:
defaultdict(<class ‘list’>, {‘118’: [‘1’], ‘285’: [‘1’, ‘5’], ‘1229’: [‘1’, ‘3’], ‘1688’: [‘1’], ‘2068’: [‘1’, ‘2’], ‘5511’: [‘2’, ‘5’], ‘19675’: [‘2’], ‘1197’: [‘2’], ‘9458’: [‘2’], ‘2233’: [‘2’, ‘6’], ‘14050’: [‘3’], ‘26050’: [‘3’], ‘1976’: [‘3’], ‘3328’: [‘3’], ‘536’: [‘2’, ‘3’], ‘22690’: [‘4’], ‘9404’: [‘4’], ‘4802’: [‘2’, ‘4’], ‘19497’: [‘4’], ‘23690’: [‘4’], ‘19640’: [‘5’], ‘3182’: [‘2’, ‘5’], ‘24409’: [‘5’], ‘25181’: [‘5’], ‘16151’: [‘6’], ‘1599’: [‘6’], ‘6993’: [‘2’, ‘3’, ‘6’], ‘13091’: [‘5’, ‘6’, ‘8’], ‘15091’: [‘6’], ‘6964’: [‘7’], ‘9464’: [‘7’], ‘10636’: [‘7’], ‘12107’: [‘7’], ‘14325’: [‘4’, ‘7’], ‘4813’: [‘8’], ‘15088’: [‘10’, ‘6’, ‘8’], ‘25519’: [‘8’], ‘15291’: [‘8’], ‘1503’: [‘8’], ‘9970’: [‘9’], ‘22771’: [‘9’], ‘1267’: [‘9’], ‘4402’: [‘9’], ‘10258’: [‘9’], ‘6623’: [‘10’, ‘8’], ‘13104’: [‘10’, ‘3’], ‘19117’: [‘10’, ‘6’], ‘171’: [‘10’], ‘5680’: [‘10’]})

索引完整代碼:

import itertools
from itertools import permutations, combinations  # used for permutations
from cryptography.fernet import Fernet  # used for the symmetric key generation
from collections import Counter  # used to count most common word
from collections import defaultdict  # used to make the the distinct word list
from llist import dllist, dllistnode  # python linked list library
import base64  # used for base 64 encoding
import os
from cryptography.hazmat.backends import default_backend  # used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import random  # to select random key
import sys
import re
import bitarray  # for lookup table


def main():
    word_dict = intialization()  # if you want to repair ,it is important
    print(word_dict)
    word_dict = intialization2()
    print(word_dict)


############################################################################################

def intialization():
    '''
    Prompts user for documents to be encrypted and generates the distinct
    words in each. Returns the distinct words and the documents that contained them
    in a dictionary 'word_dict'
    '''

    filenames = []
    x = input("Please enter the name of a file you want to encrypt: ")  # filename
    filenames.append(x)
    while (True):
        x = input("\nEnter another file name or press enter if done: ")
        if not x:
            break
        filenames.append(x)
    # finds the occurence of each word in a flle
    filedata = []
    for idx, val in enumerate(filenames):#val is the name of the file
        cnt = Counter()
        for line in open(filenames[idx], 'r'):#這裏的line感覺是文件中的所有內容,,還是一個個單詞讀的??
            print(line)
            word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
            for word in word_list:
                cnt[word] += 1
        filedata.append((val, cnt))#這其實是一個統計詞頻的
        print(filedata)

    # takes the 5 most common from each document as the distinct words,in fact ,this is not necessary
    allwords = []
    for idx, val in enumerate(filedata):
        for value, count in val[1].most_common(5):
            if value not in allwords:
                allwords.append(value)
    print(allwords)
    # makes a dictory with the distinct word as index and a value of a list of filenames
    word_dict = defaultdict(list)

    for i in allwords:
        word_dict[i]
        for idx, val in enumerate(filedata):
            if i in val[1]:#val[1] is allwords of the value
                word_dict[i].append(val[0])# val[0] is the name of file
        word_dict[i].sort()


    return word_dict


############################################################################################
def intialization2():

    filenames = ["data1.txt"]
    # finds the occurence of each word in a flle
    filedata = []
    list1=[]
    docnum=0
    linenum=0
    for idx, val in enumerate(filenames):#val is the name of the file
        cnt = Counter()
        for line in open(filenames[idx], 'r'):#這裏的line經過測試是按照一行一行讀的,每一行有三個數字
            a=1
            linenum+=1
            word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
            for data in word_list:
                if a==1:#說明這個提取的詞是文檔的編號
                    if linenum!=1:
                        if str(doc)!=str(data):
                            filedata.append((doc, cnt))
                            del cnt
                            cnt = Counter()
                            docnum+=1
                    doc=data
                    a+=1
                    continue
                if a==2:#說明這個提取的詞是單詞的編碼
                    term=data
                    a+=1
                    continue
                if a==3:
                    fre=data
            cnt[term] = fre
                #print("first line"+data)
        filedata.append((doc, cnt))
    #print(filedata)
    allwords = []
    for idx, val in enumerate(filedata):
        for value, count in val[1].most_common(5):
            if value not in allwords:
                allwords.append(value)
    #print(allwords)
    # makes a dictory with the distinct word as index and a value of a list of filenames
    word_dict = defaultdict(list)
    for i in allwords:
        word_dict[i]
        for idx, val in enumerate(filedata):
            if i in val[1]:  # val[1] is allwords of the value
                word_dict[i].append(val[0])  # val[0] is the name of file
        word_dict[i].sort()
    #print(word_dict)
    return word_dict
if __name__ == '__main__':
	main()

算法

 根據輸入的密碼生成密鑰:

key_s, key_y, key_z = keygen(password)

 構造節點存儲數據並進行加密,存儲在數組中:
生成密鑰後,對於關鍵詞wi的所有文檔,構造節點數組,每個節點由三部分組成:文檔id,下一個節點的加密密鑰,下一個節點的加密地址
在這裏插入圖片描述
我們使用密鑰K(i,j-1)加密節點N(i,j),並將其保存在數組中的第K1(ctr)個位置,同時令ctr=ctr+1
實現:
(1) 初始化數組A,地址,加密密鑰:

A = [0] * 10000
ctr = 1
keyword_key_pair = []

(2) 對word list中的滅一個word,對一個節點生成key

K_i_0 = Fernet.generate_key()
keyword_key_pair.append([i, K_i_0, ctr])

(3) 對後面的文檔(1 <= j <= |D(wi)|)中的每一個,構建節點

N(i,j) = (id(D(i,j) || K(i,j) || v(s)(ctr+1))
K_i_j = Fernet.generate_key()
curr_addr = psuedo_random(key_s, ctr)
N = doc + "\n" + str(K_i_j) + "\n" + str(next_addr)

特別要注意對沒後一個文檔的處理,即指針尾的處理:

if j == len(doc_list) - 1:
next_addr = None
else:
next_addr = psuedo_random(key_s, ctr+1)

(4) 當然,這個節點本身也是要用K(i,j-1)進行加密的

N = Fernet(K_i_jminus1).encrypt(str.encode(N))

(5) 將加密後的節點存儲後數組中,並且進行更新:

A[curr_addr] = N
K_i_jminus1 = K_i_j
ctr = ctr + 1

 生成表頭,存儲wi的所有文檔的第一個文檔位置,存儲置換函數的異或結果:
在這裏插入圖片描述
(1)設置僞隨機置換

random.seed(keyword + str(key_z))
index = random.randrange(0, 1000)

(2)計算<addr(A(Ni,1)||K_i_0)>

addr = psuedo_random(key_s,ctr)
value = str(addr) + "\n" + str(key)

(3)對f_y進行異或

	cat_string = []	#empty string to begin
for m in value:
	#concatenate ascii value of each character in value
	cat_string.append(ord(m))
value = [f_y ^ x for x in cat_string]

(4)將所有等於零的元素設置爲某個隨機鍵值:

for ind,val in enumerate(T):
if (val == 0):
x = random.randrange(0, 10000)
x = str(x)
x = Fernet(key_s).encrypt(str.encode(x))
T[ind] = x

完成以上內容以後,對索引節點加密的部分基本完成,我們已經建立起數組A和表單T,接下來就可以進入查詢query的部分了!
 生成陷門
爲了保證服務器不知道我們的查詢內容,我們同樣需要對查詢的關鍵字進行加密,這個過程就叫陷門。生成的陷門形式爲:
在這裏插入圖片描述
(1) 返回置換函數,關鍵字的僞隨機排列函數

random.seed(keyword + str(key_z))
index = random.randrange(0, 1000)
random.seed(keyword + str(key_y))
f_y = random.randrange(0,1000)

 搜索
前面的過程中,我們已經有了T, A, trapdoor,這樣服務器對給定的trapdoor,就可以從T中找到第一個節點的地址,再從A從不斷查找下一個,就可以完成搜索。最後服務器將找到的文檔標識返回給數據的所有者。這裏要用到的一個數學性質就是a異或b再異或b等於a,這是解密的關鍵。
(1)將ascii值與f_y進行異或運算,以獲得字符串的列表,包含節點的地址和密鑰

addr_and_key = [chr(f_y ^ x) for x in value]
mystring = ''
for x in addr_and_key:
mystring = mystring + x

這裏的mystring就是節點的地址
這樣我們就可以對節點的內容進行解密,得到doc-id,增加到結果中即可。

split_node = re.split(r"\n", d_n)
doc_id = split_node[0]
key = split_node[1]
addr = split_node[2]
list_of_docs.append(doc_id)

實現效果:

我們以123456爲密碼,可以看到以下實現效果:
在這裏插入圖片描述
建立A,T完成後,可以進行搜索,我們以118爲例子:
在這裏插入圖片描述
在這裏插入圖片描述
源代碼:


import itertools 
from itertools import permutations, combinations #used for permutations
from cryptography.fernet import Fernet #used for the symmetric key generation
from collections import Counter #used to count most common word
from collections import defaultdict # used to make the the distinct word list
from llist import dllist, dllistnode # python linked list library
import base64 #used for base 64 encoding
import os 
from cryptography.hazmat.backends import default_backend #used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import random #to select random key
import sys
import re
import bitarray #for lookup table


def main():
	print("Welcome to Searchable Symmetric Encryption.\n\n")
	reply = input("Do you already have an encrypted data set? (Y)es or (N)o: ")
	while(True):
		#if yes then you have already generated keys and just want to search
		if(reply.lower() == 'y' or reply.lower() == "yes"):
			password = input("Please enter the password: ")
			break

		#if no then need to generate symmetric keys
		elif (reply.lower() == 'n' or reply.lower() == "no"):
			password = None
			while(True):
				password1 = input("Please choose a password: ")
				password2 = input("Please re-enter the password: ")
				if(password1 == password2):
					password = password1
					print("")
					break
				print("Passwords not the same try again\n")
            # here really start
			key_s, key_y, key_z = keygen(password)

			word_dict = intialization2()#if you want to repair ,it is important

			A, keyword_key_pair = build_array(word_dict, key_s, key_y, key_z)#A is the array, keyword_key_pair is  gauge outfit

			T = look_up_table(keyword_key_pair, key_s, key_y, key_z)

			print("\n\nWelcome!")
			keyword = input("\nPlease enter the keyword to search, or 'exit' to exit: ")

			while keyword != 'exit':
				
				trapdoor = Trapdoor(keyword, key_z, key_y)		#陷門

				list_of_docs = Search(T, A, trapdoor)       #搜索

				print(f"\nSearch Results for \"{keyword}\":\n")
				for i in list_of_docs:
					print(i)

				keyword = input("\nPlease enter the keyword to search, or 'exit' to exit: ")

			print("\n\nGoodbye!\n")

			break

		#this just makes sure user enters yes or no or y or n
		else:
			reply = input("\nInput Y for yes or N for no: ")



############################################################################################

def intialization():
	''' 
	Prompts user for documents to be encrypted and generates the distinct 
	words in each. Returns the distinct words and the documents that contained them
	in a dictionary 'word_dict'
	'''

	filenames = []
	x = input("Please enter the name of a file you want to encrypt: ")        #filename
	filenames.append(x)
	while(True):
		x = input("\nEnter another file name or press enter if done: ")
		if not x:
			break
		filenames.append(x)
			# finds the occurence of each word in a flle
	filedata = []
	for idx, val in enumerate(filenames):
		cnt = Counter()
		for line in open(filenames[idx], 'r'):
			word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
			for word in word_list:
				cnt[word]+=1
		filedata.append((val,cnt))
			
	#takes the 5 most common from each document as the distinct words
	allwords = []
	for idx, val in enumerate(filedata):
		for value, count in val[1].most_common(5):
			if  value not in allwords:
				allwords.append(value)

	#makes a dictory with the distinct word as index and a value of a list of filenames
	word_dict = defaultdict(list)
	for i in allwords:
		word_dict[i]
		for idx, val in enumerate(filedata):
			if i in val[1]:
				word_dict[i].append(val[0])
		word_dict[i].sort()

	return word_dict

############################################################################################

def keygen(u_password):
	''' Generates 3 keys, key s,y,z, based on the given password from the user. '''

	# This is input in the form of a string
	password_provided = u_password 

	# Convert to type bytes
	password = password_provided.encode() 

	salt_s = b'\x91\xabr\xebx\xc5\x9dx^b_7\xb6\x8a\xbb5'
	salt_y = b'\x1cy8\r\x7f\xf8,\xe2Pu!/\x043\xdc\x0e'
	salt_z = b'\x9b\xd0\xb6\x85!J\xde\xe5\xc8\xb3\xc9\xa2\tqPy'

	kdf_s = PBKDF2HMAC(
	    algorithm=hashes.SHA256(),
	    length=32,
	    salt=salt_s,
	    iterations=100000,
	    backend=default_backend()
	)
	key_s = base64.urlsafe_b64encode(kdf_s.derive(password))

	kdf_y = PBKDF2HMAC(
	    algorithm=hashes.SHA256(),
	    length=32,
	    salt=salt_y,
	    iterations=100000,
	    backend=default_backend()
	)
	key_y = base64.urlsafe_b64encode(kdf_y.derive(password))

	kdf_z = PBKDF2HMAC(
	    algorithm=hashes.SHA256(),
	    length=32,
	    salt=salt_z,
	    iterations=100000,
	    backend=default_backend()
	)
	key_z = base64.urlsafe_b64encode(kdf_z.derive(password))

	#returns three base_64 encoded keys
	return key_s, key_y, key_z


############################################################################################
def psuedo_random(key_s, ctr):
	''' A pseudorandom function based on key s, to return a value to index array A '''

	#Convert key s to decimal value
	decimal_key = int.from_bytes(key_s, byteorder=sys.byteorder)
	combined = decimal_key + ctr

	#Find a random value based on key s and counter
	random.seed(combined)
	index = random.randrange(0, 10000)
	return index

############################################################################################
def build_array(word_dict, key_s, key_y, key_z):
	'''
	Creates an array of nodes, each containing the document id, key to encrypt the
	next node, and the address of the next node
	'''

	A = [0] * 10000
	ctr = 1
	keyword_key_pair = []

	#for each word in set of distinct words, word_dict in this case	
	for i, doc_list in word_dict.items():

		#Generate a key for the first node
		K_i_0 = Fernet.generate_key()
		keyword_key_pair.append([i, K_i_0, ctr])

		#initialize the previous key to the first one created
		K_i_jminus1 = K_i_0 
		
		# for 1 <= j <= |D(wi)|:
		# for each document which has distinct word wi....iterate through doc_list
		for j, doc in enumerate(doc_list):
			
			#again generate key K(i,j)
			K_i_j = Fernet.generate_key()

			#N(i,j) = (id(D(i,j) || K(i,j) || v(s)(ctr+1)), where id(D(i,j) is the jth identifier in D(wi)
	
			curr_addr = psuedo_random(key_s, ctr)
			if j == len(doc_list) - 1:
				next_addr = None
			else:
				next_addr = psuedo_random(key_s, ctr+1)# return a random value
			
			N = doc + "\n" + str(K_i_j) + "\n" + str(next_addr)
			#newline is a delimeter to separate three components of the encrypted string
			#N = doc + K_i_j + address of next node. 

			#encrypt N with Ki,j-1, ie the previous key
			N = Fernet(K_i_jminus1).encrypt(str.encode(N))

			#update and save K at i,j-1
			K_i_jminus1 = K_i_j 

			#store the encrypted N in the array
			A[curr_addr] = N
	
			#update counter
			ctr = ctr + 1

	# Filling in the rest of the array with random encrypted data
	for ind,val in enumerate(A):
		if (val == 0):
			x = random.randrange(0, 10000)
			x = str(x)
			x = Fernet(key_s).encrypt(str.encode(x))
			A[ind] = x


	return A, keyword_key_pair

############################################################################################
def look_up_table(keyword_key_pair,key_s, key_y, key_z):
	'''
	Generates a table which stores the XORed result of permutation
	function f_y and the address of a node concatenated with the key
	'''
	T = [0] * 1000
	for i in keyword_key_pair:
		keyword = i[0]
		key = i[1]
		ctr = i[2]

		# pseudorandom permutation on z
		random.seed(keyword + str(key_z))
		index = random.randrange(0, 1000)

		#computes value <addr(A(Ni,1)||K_i_0)>
		addr = psuedo_random(key_s,ctr)
		value = str(addr) + "\n" + str(key)

		#computed 'f_y(w_i)'
		random.seed(keyword + str(key_y))
		f_y = random.randrange(0,1000)

		#XOR value with f_y
		cat_string = []	#empty string to begin
		for m in value:
			#concatenate ascii value of each character in value
			cat_string.append(ord(m))

		value = [f_y ^ x for x in cat_string]

		T[index] = value
	

	#set all elements equal to zero as some random key value
	for ind,val in enumerate(T):
		if (val == 0):
			x = random.randrange(0, 10000)
			x = str(x)
			x = Fernet(key_s).encrypt(str.encode(x))
			T[ind] = x
	return T

def Trapdoor(keyword, key_z, key_y):
	'''
	returns the permutation function
	and pseudorandom permutation function on keyword
	'''
	random.seed(keyword + str(key_z))
	index = random.randrange(0, 1000)
	
	#the pseudo-random function 'f_y(w)'
	random.seed(keyword + str(key_y))
	f_y = random.randrange(0,1000)

	return (index, f_y)


def Search(T, A, trapdoor):
	'''
	Indexes both T and A with trapdoor values generated by keyword 
	in main to find and decrypt the document ids
	'''

	list_of_docs = []

	value = T[trapdoor[0]]

	f_y = trapdoor[1]

	#XORs the ascii value with f_y to obtain list version of string 
	#	containing the address and the key for the node
	addr_and_key = [chr(f_y ^ x) for x in value]
	
	#converts the list into one string
	mystring = ''
	for x in addr_and_key:
		mystring = mystring + x

	#addr_node is a list.
	addr_node = re.split(r"\n", str(mystring))

	#if addr_node isn't two separate items, we didn't find a document
	if len(addr_node) == 1:
		print("\n")
	else:

		addr = addr_node[0]
		key = addr_node[1]

		#remove b' at the beginning and ' at the end
		key = key[2:-1]

		#get the node based on the address from array A
		node = A[int(addr)]

		#turn key back into bytes and use Fernet function to 
		#	decrypt back to plaintext
		decrypted_node = Fernet(str.encode(key)).decrypt(node)
		
		#remove b' at the beginning and ' at the end
		d_n = str(decrypted_node)[2:-1]
		split_node = re.split(r"\\n", d_n)
		doc_id = split_node[0]
		key = split_node[1]
		addr = split_node[2]

		list_of_docs.append(doc_id)

		#Repeat iterating while the address is not null, meaning
		#	 there are still documents with the keyword
		while addr != 'None':
			key = key[2:-1]
			key = str.encode(key)
			node = A[int(addr)]
			decrypted_node = Fernet(key).decrypt(node)
			d_n = str(decrypted_node)[2:-1]
			split_node = re.split(r"\\n", d_n)
			doc_id = split_node[0]
			key = split_node[1]
			addr = split_node[2]
			list_of_docs.append(doc_id)

	return list_of_docs
def intialization2():

    filenames = ["data2.txt"]
    # finds the occurence of each word in a flle
    filedata = []
    docnum=0
    linenum=0
    for idx, val in enumerate(filenames):#val is the name of the file
        cnt = Counter()
        for line in open(filenames[idx], 'r'):#這裏的line經過測試是按照一行一行讀的,每一行有三個數字
            a=1
            linenum+=1
            word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
            for data in word_list:
                if a==1:#說明這個提取的詞是文檔的編號
                    if linenum!=1:
                        if str(doc)!=str(data):
                            filedata.append((doc, cnt))
                            del cnt
                            cnt = Counter()
                            docnum+=1
                    doc=data
                    a+=1
                    continue
                if a==2:#說明這個提取的詞是單詞的編碼
                    term=data
                    a+=1
                    continue
                if a==3:
                    fre=data
            cnt[term] = fre
                #print("first line"+data)
        filedata.append((doc, cnt))
    #print(filedata)
    allwords = []
    for idx, val in enumerate(filedata):
        for value, count in val[1].most_common(2):
            if value not in allwords:
                allwords.append(value)
    #print(allwords)
    # makes a dictory with the distinct word as index and a value of a list of filenames
    word_dict = defaultdict(list)
    for i in allwords:
        word_dict[i]
        for idx, val in enumerate(filedata):
            if i in val[1]:  # val[1] is allwords of the value
                word_dict[i].append(val[0])  # val[0] is the name of file
        word_dict[i].sort()
    #print(word_dict)
    return word_dict
if __name__ == '__main__':
	main()

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章