引言:
在IT界,大數據安全和密碼學的高級實現似乎很難找到,很簡單的一個例子是:倒排索引的實現有很多,但是在加密基礎上再次實現密文檢索和倒排索引卻是寥寥無幾,這篇博文基於對稱密文實現檢索。
數據集
真實數據集:
http://archive.ics.uci.edu/ml/datasets/Bag+of+Words
Enron Emails,NIPS full papers,NYTimes news articles 用關鍵詞W對密文建立索引,對密文進行檢索 D=39861 代表文檔數目 W=28102代表單詞數目 N=6,400,000 (approx)代表單詞總數
問題描述
實現以下加密方案
環境
Python3.7
Pycharm professional
Cryptography(python密碼學算法庫)
Cryptography是Python提供的一個爲密碼學使用者提供便利的第三方庫,官方中,cryptography 的目標是成爲“人們易於使用的密碼學包”,就像 requests 是“人們易於使用的http庫”一樣。這個想法使你能夠創建簡單安全、易於使用的加密方案。如果有需要的話,你也可以使用一些底層的密碼學基元,但這也需要你知道更多的細節,否則創建的東西將是不安全的。
參考學習鏈接:
官網:
https://cryptography.io/en/latest/
Python3加密學習:
https://linux.cn/article-7676-1.html
安裝:
Pip install cryptography
調用(這裏只提供本文需要的包)
from cryptography.fernet import Fernet #used for the symmetric key generation
from cryptography.hazmat.backends import default_backend #used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
dataset
數據集: 本次實驗採用University of California Irvine的數據集,Enron Emails,NIPS full papers,NYTimes news articles 用關鍵詞W對密文建立索引,對密文進行檢索 D=39861 代表文檔數目 W=28102代表單詞數目 N=6,400,000 (approx)代表單詞總數。
首先以Enron Emails數據爲例子,數據文件的數據以如下的形式呈現:
其中每一行的第一個數據是文件的編號,第二個數據是單詞的編號,第三個數據是詞頻,這種形式的數據其實爲我們在進行倒排索引的構建時提供了便利。
倒排索引word_dict的構建
這一部分是對數據集構建基本的倒排索引,採用傳統的建立方法就可以,核心代碼如下:
for idx, val in enumerate(filenames):#val is the name of the file
cnt = Counter()
for line in open(filenames[idx], 'r'):
print(line)
word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
for word in word_list:
cnt[word] += 1
filedata.append((val, cnt))
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]:#val[1] is allwords of the value
word_dict[i].append(val[0])# val[0] is the name of file
word_dict[i].sort()
首先我們通過兩個簡單的測試文件進行驗證:
example1:
File1:“hello this is a test data file data file file”
File2: “also file data is a test file”
將這兩個文件作爲輸入,可以看到輸出的倒排索引如下:word_dict
defaultdict(<class ‘list’>, {‘file’: [‘simple.txt’, ‘simple2.txt’], ‘data’: [‘simple.txt’, ‘simple2.txt’], ‘hello’: [‘simple.txt’], ‘this’: [‘simple.txt’], ‘is’: [‘simple.txt’, ‘simple2.txt’], ‘also’: [‘simple2.txt’], ‘a’: [‘simple.txt’, ‘simple2.txt’]})
example2:
有了這兩個的基礎,我們再對Enron Emails dataset進行word_list構建,考慮到Enron Emails dataset的數據量較大,難以從輸出上看到結構,我們只取其中的前10個文件對應的數據集運行,得到如下的word_dict:
defaultdict(<class ‘list’>, {‘118’: [‘1’], ‘285’: [‘1’, ‘5’], ‘1229’: [‘1’, ‘3’], ‘1688’: [‘1’], ‘2068’: [‘1’, ‘2’], ‘5511’: [‘2’, ‘5’], ‘19675’: [‘2’], ‘1197’: [‘2’], ‘9458’: [‘2’], ‘2233’: [‘2’, ‘6’], ‘14050’: [‘3’], ‘26050’: [‘3’], ‘1976’: [‘3’], ‘3328’: [‘3’], ‘536’: [‘2’, ‘3’], ‘22690’: [‘4’], ‘9404’: [‘4’], ‘4802’: [‘2’, ‘4’], ‘19497’: [‘4’], ‘23690’: [‘4’], ‘19640’: [‘5’], ‘3182’: [‘2’, ‘5’], ‘24409’: [‘5’], ‘25181’: [‘5’], ‘16151’: [‘6’], ‘1599’: [‘6’], ‘6993’: [‘2’, ‘3’, ‘6’], ‘13091’: [‘5’, ‘6’, ‘8’], ‘15091’: [‘6’], ‘6964’: [‘7’], ‘9464’: [‘7’], ‘10636’: [‘7’], ‘12107’: [‘7’], ‘14325’: [‘4’, ‘7’], ‘4813’: [‘8’], ‘15088’: [‘10’, ‘6’, ‘8’], ‘25519’: [‘8’], ‘15291’: [‘8’], ‘1503’: [‘8’], ‘9970’: [‘9’], ‘22771’: [‘9’], ‘1267’: [‘9’], ‘4402’: [‘9’], ‘10258’: [‘9’], ‘6623’: [‘10’, ‘8’], ‘13104’: [‘10’, ‘3’], ‘19117’: [‘10’, ‘6’], ‘171’: [‘10’], ‘5680’: [‘10’]})
索引完整代碼:
import itertools
from itertools import permutations, combinations # used for permutations
from cryptography.fernet import Fernet # used for the symmetric key generation
from collections import Counter # used to count most common word
from collections import defaultdict # used to make the the distinct word list
from llist import dllist, dllistnode # python linked list library
import base64 # used for base 64 encoding
import os
from cryptography.hazmat.backends import default_backend # used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import random # to select random key
import sys
import re
import bitarray # for lookup table
def main():
word_dict = intialization() # if you want to repair ,it is important
print(word_dict)
word_dict = intialization2()
print(word_dict)
############################################################################################
def intialization():
'''
Prompts user for documents to be encrypted and generates the distinct
words in each. Returns the distinct words and the documents that contained them
in a dictionary 'word_dict'
'''
filenames = []
x = input("Please enter the name of a file you want to encrypt: ") # filename
filenames.append(x)
while (True):
x = input("\nEnter another file name or press enter if done: ")
if not x:
break
filenames.append(x)
# finds the occurence of each word in a flle
filedata = []
for idx, val in enumerate(filenames):#val is the name of the file
cnt = Counter()
for line in open(filenames[idx], 'r'):#這裏的line感覺是文件中的所有內容,,還是一個個單詞讀的??
print(line)
word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
for word in word_list:
cnt[word] += 1
filedata.append((val, cnt))#這其實是一個統計詞頻的
print(filedata)
# takes the 5 most common from each document as the distinct words,in fact ,this is not necessary
allwords = []
for idx, val in enumerate(filedata):
for value, count in val[1].most_common(5):
if value not in allwords:
allwords.append(value)
print(allwords)
# makes a dictory with the distinct word as index and a value of a list of filenames
word_dict = defaultdict(list)
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]:#val[1] is allwords of the value
word_dict[i].append(val[0])# val[0] is the name of file
word_dict[i].sort()
return word_dict
############################################################################################
def intialization2():
filenames = ["data1.txt"]
# finds the occurence of each word in a flle
filedata = []
list1=[]
docnum=0
linenum=0
for idx, val in enumerate(filenames):#val is the name of the file
cnt = Counter()
for line in open(filenames[idx], 'r'):#這裏的line經過測試是按照一行一行讀的,每一行有三個數字
a=1
linenum+=1
word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
for data in word_list:
if a==1:#說明這個提取的詞是文檔的編號
if linenum!=1:
if str(doc)!=str(data):
filedata.append((doc, cnt))
del cnt
cnt = Counter()
docnum+=1
doc=data
a+=1
continue
if a==2:#說明這個提取的詞是單詞的編碼
term=data
a+=1
continue
if a==3:
fre=data
cnt[term] = fre
#print("first line"+data)
filedata.append((doc, cnt))
#print(filedata)
allwords = []
for idx, val in enumerate(filedata):
for value, count in val[1].most_common(5):
if value not in allwords:
allwords.append(value)
#print(allwords)
# makes a dictory with the distinct word as index and a value of a list of filenames
word_dict = defaultdict(list)
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]: # val[1] is allwords of the value
word_dict[i].append(val[0]) # val[0] is the name of file
word_dict[i].sort()
#print(word_dict)
return word_dict
if __name__ == '__main__':
main()
算法
根據輸入的密碼生成密鑰:
key_s, key_y, key_z = keygen(password)
構造節點存儲數據並進行加密,存儲在數組中:
生成密鑰後,對於關鍵詞wi的所有文檔,構造節點數組,每個節點由三部分組成:文檔id,下一個節點的加密密鑰,下一個節點的加密地址
我們使用密鑰K(i,j-1)加密節點N(i,j),並將其保存在數組中的第K1(ctr)個位置,同時令ctr=ctr+1
實現:
(1) 初始化數組A,地址,加密密鑰:
A = [0] * 10000
ctr = 1
keyword_key_pair = []
(2) 對word list中的滅一個word,對一個節點生成key
K_i_0 = Fernet.generate_key()
keyword_key_pair.append([i, K_i_0, ctr])
(3) 對後面的文檔(1 <= j <= |D(wi)|)中的每一個,構建節點
N(i,j) = (id(D(i,j) || K(i,j) || v(s)(ctr+1))
K_i_j = Fernet.generate_key()
curr_addr = psuedo_random(key_s, ctr)
N = doc + "\n" + str(K_i_j) + "\n" + str(next_addr)
特別要注意對沒後一個文檔的處理,即指針尾的處理:
if j == len(doc_list) - 1:
next_addr = None
else:
next_addr = psuedo_random(key_s, ctr+1)
(4) 當然,這個節點本身也是要用K(i,j-1)進行加密的
N = Fernet(K_i_jminus1).encrypt(str.encode(N))
(5) 將加密後的節點存儲後數組中,並且進行更新:
A[curr_addr] = N
K_i_jminus1 = K_i_j
ctr = ctr + 1
生成表頭,存儲wi的所有文檔的第一個文檔位置,存儲置換函數的異或結果:
(1)設置僞隨機置換
random.seed(keyword + str(key_z))
index = random.randrange(0, 1000)
(2)計算<addr(A(Ni,1)||K_i_0)>
addr = psuedo_random(key_s,ctr)
value = str(addr) + "\n" + str(key)
(3)對f_y進行異或
cat_string = [] #empty string to begin
for m in value:
#concatenate ascii value of each character in value
cat_string.append(ord(m))
value = [f_y ^ x for x in cat_string]
(4)將所有等於零的元素設置爲某個隨機鍵值:
for ind,val in enumerate(T):
if (val == 0):
x = random.randrange(0, 10000)
x = str(x)
x = Fernet(key_s).encrypt(str.encode(x))
T[ind] = x
完成以上內容以後,對索引節點加密的部分基本完成,我們已經建立起數組A和表單T,接下來就可以進入查詢query的部分了!
生成陷門
爲了保證服務器不知道我們的查詢內容,我們同樣需要對查詢的關鍵字進行加密,這個過程就叫陷門。生成的陷門形式爲:
(1) 返回置換函數,關鍵字的僞隨機排列函數
random.seed(keyword + str(key_z))
index = random.randrange(0, 1000)
random.seed(keyword + str(key_y))
f_y = random.randrange(0,1000)
搜索
前面的過程中,我們已經有了T, A, trapdoor,這樣服務器對給定的trapdoor,就可以從T中找到第一個節點的地址,再從A從不斷查找下一個,就可以完成搜索。最後服務器將找到的文檔標識返回給數據的所有者。這裏要用到的一個數學性質就是a異或b再異或b等於a,這是解密的關鍵。
(1)將ascii值與f_y進行異或運算,以獲得字符串的列表,包含節點的地址和密鑰
addr_and_key = [chr(f_y ^ x) for x in value]
mystring = ''
for x in addr_and_key:
mystring = mystring + x
這裏的mystring就是節點的地址
這樣我們就可以對節點的內容進行解密,得到doc-id,增加到結果中即可。
split_node = re.split(r"\n", d_n)
doc_id = split_node[0]
key = split_node[1]
addr = split_node[2]
list_of_docs.append(doc_id)
實現效果:
我們以123456爲密碼,可以看到以下實現效果:
建立A,T完成後,可以進行搜索,我們以118爲例子:
源代碼:
import itertools
from itertools import permutations, combinations #used for permutations
from cryptography.fernet import Fernet #used for the symmetric key generation
from collections import Counter #used to count most common word
from collections import defaultdict # used to make the the distinct word list
from llist import dllist, dllistnode # python linked list library
import base64 #used for base 64 encoding
import os
from cryptography.hazmat.backends import default_backend #used in making key from password
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
import random #to select random key
import sys
import re
import bitarray #for lookup table
def main():
print("Welcome to Searchable Symmetric Encryption.\n\n")
reply = input("Do you already have an encrypted data set? (Y)es or (N)o: ")
while(True):
#if yes then you have already generated keys and just want to search
if(reply.lower() == 'y' or reply.lower() == "yes"):
password = input("Please enter the password: ")
break
#if no then need to generate symmetric keys
elif (reply.lower() == 'n' or reply.lower() == "no"):
password = None
while(True):
password1 = input("Please choose a password: ")
password2 = input("Please re-enter the password: ")
if(password1 == password2):
password = password1
print("")
break
print("Passwords not the same try again\n")
# here really start
key_s, key_y, key_z = keygen(password)
word_dict = intialization2()#if you want to repair ,it is important
A, keyword_key_pair = build_array(word_dict, key_s, key_y, key_z)#A is the array, keyword_key_pair is gauge outfit
T = look_up_table(keyword_key_pair, key_s, key_y, key_z)
print("\n\nWelcome!")
keyword = input("\nPlease enter the keyword to search, or 'exit' to exit: ")
while keyword != 'exit':
trapdoor = Trapdoor(keyword, key_z, key_y) #陷門
list_of_docs = Search(T, A, trapdoor) #搜索
print(f"\nSearch Results for \"{keyword}\":\n")
for i in list_of_docs:
print(i)
keyword = input("\nPlease enter the keyword to search, or 'exit' to exit: ")
print("\n\nGoodbye!\n")
break
#this just makes sure user enters yes or no or y or n
else:
reply = input("\nInput Y for yes or N for no: ")
############################################################################################
def intialization():
'''
Prompts user for documents to be encrypted and generates the distinct
words in each. Returns the distinct words and the documents that contained them
in a dictionary 'word_dict'
'''
filenames = []
x = input("Please enter the name of a file you want to encrypt: ") #filename
filenames.append(x)
while(True):
x = input("\nEnter another file name or press enter if done: ")
if not x:
break
filenames.append(x)
# finds the occurence of each word in a flle
filedata = []
for idx, val in enumerate(filenames):
cnt = Counter()
for line in open(filenames[idx], 'r'):
word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
for word in word_list:
cnt[word]+=1
filedata.append((val,cnt))
#takes the 5 most common from each document as the distinct words
allwords = []
for idx, val in enumerate(filedata):
for value, count in val[1].most_common(5):
if value not in allwords:
allwords.append(value)
#makes a dictory with the distinct word as index and a value of a list of filenames
word_dict = defaultdict(list)
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]:
word_dict[i].append(val[0])
word_dict[i].sort()
return word_dict
############################################################################################
def keygen(u_password):
''' Generates 3 keys, key s,y,z, based on the given password from the user. '''
# This is input in the form of a string
password_provided = u_password
# Convert to type bytes
password = password_provided.encode()
salt_s = b'\x91\xabr\xebx\xc5\x9dx^b_7\xb6\x8a\xbb5'
salt_y = b'\x1cy8\r\x7f\xf8,\xe2Pu!/\x043\xdc\x0e'
salt_z = b'\x9b\xd0\xb6\x85!J\xde\xe5\xc8\xb3\xc9\xa2\tqPy'
kdf_s = PBKDF2HMAC(
algorithm=hashes.SHA256(),
length=32,
salt=salt_s,
iterations=100000,
backend=default_backend()
)
key_s = base64.urlsafe_b64encode(kdf_s.derive(password))
kdf_y = PBKDF2HMAC(
algorithm=hashes.SHA256(),
length=32,
salt=salt_y,
iterations=100000,
backend=default_backend()
)
key_y = base64.urlsafe_b64encode(kdf_y.derive(password))
kdf_z = PBKDF2HMAC(
algorithm=hashes.SHA256(),
length=32,
salt=salt_z,
iterations=100000,
backend=default_backend()
)
key_z = base64.urlsafe_b64encode(kdf_z.derive(password))
#returns three base_64 encoded keys
return key_s, key_y, key_z
############################################################################################
def psuedo_random(key_s, ctr):
''' A pseudorandom function based on key s, to return a value to index array A '''
#Convert key s to decimal value
decimal_key = int.from_bytes(key_s, byteorder=sys.byteorder)
combined = decimal_key + ctr
#Find a random value based on key s and counter
random.seed(combined)
index = random.randrange(0, 10000)
return index
############################################################################################
def build_array(word_dict, key_s, key_y, key_z):
'''
Creates an array of nodes, each containing the document id, key to encrypt the
next node, and the address of the next node
'''
A = [0] * 10000
ctr = 1
keyword_key_pair = []
#for each word in set of distinct words, word_dict in this case
for i, doc_list in word_dict.items():
#Generate a key for the first node
K_i_0 = Fernet.generate_key()
keyword_key_pair.append([i, K_i_0, ctr])
#initialize the previous key to the first one created
K_i_jminus1 = K_i_0
# for 1 <= j <= |D(wi)|:
# for each document which has distinct word wi....iterate through doc_list
for j, doc in enumerate(doc_list):
#again generate key K(i,j)
K_i_j = Fernet.generate_key()
#N(i,j) = (id(D(i,j) || K(i,j) || v(s)(ctr+1)), where id(D(i,j) is the jth identifier in D(wi)
curr_addr = psuedo_random(key_s, ctr)
if j == len(doc_list) - 1:
next_addr = None
else:
next_addr = psuedo_random(key_s, ctr+1)# return a random value
N = doc + "\n" + str(K_i_j) + "\n" + str(next_addr)
#newline is a delimeter to separate three components of the encrypted string
#N = doc + K_i_j + address of next node.
#encrypt N with Ki,j-1, ie the previous key
N = Fernet(K_i_jminus1).encrypt(str.encode(N))
#update and save K at i,j-1
K_i_jminus1 = K_i_j
#store the encrypted N in the array
A[curr_addr] = N
#update counter
ctr = ctr + 1
# Filling in the rest of the array with random encrypted data
for ind,val in enumerate(A):
if (val == 0):
x = random.randrange(0, 10000)
x = str(x)
x = Fernet(key_s).encrypt(str.encode(x))
A[ind] = x
return A, keyword_key_pair
############################################################################################
def look_up_table(keyword_key_pair,key_s, key_y, key_z):
'''
Generates a table which stores the XORed result of permutation
function f_y and the address of a node concatenated with the key
'''
T = [0] * 1000
for i in keyword_key_pair:
keyword = i[0]
key = i[1]
ctr = i[2]
# pseudorandom permutation on z
random.seed(keyword + str(key_z))
index = random.randrange(0, 1000)
#computes value <addr(A(Ni,1)||K_i_0)>
addr = psuedo_random(key_s,ctr)
value = str(addr) + "\n" + str(key)
#computed 'f_y(w_i)'
random.seed(keyword + str(key_y))
f_y = random.randrange(0,1000)
#XOR value with f_y
cat_string = [] #empty string to begin
for m in value:
#concatenate ascii value of each character in value
cat_string.append(ord(m))
value = [f_y ^ x for x in cat_string]
T[index] = value
#set all elements equal to zero as some random key value
for ind,val in enumerate(T):
if (val == 0):
x = random.randrange(0, 10000)
x = str(x)
x = Fernet(key_s).encrypt(str.encode(x))
T[ind] = x
return T
def Trapdoor(keyword, key_z, key_y):
'''
returns the permutation function
and pseudorandom permutation function on keyword
'''
random.seed(keyword + str(key_z))
index = random.randrange(0, 1000)
#the pseudo-random function 'f_y(w)'
random.seed(keyword + str(key_y))
f_y = random.randrange(0,1000)
return (index, f_y)
def Search(T, A, trapdoor):
'''
Indexes both T and A with trapdoor values generated by keyword
in main to find and decrypt the document ids
'''
list_of_docs = []
value = T[trapdoor[0]]
f_y = trapdoor[1]
#XORs the ascii value with f_y to obtain list version of string
# containing the address and the key for the node
addr_and_key = [chr(f_y ^ x) for x in value]
#converts the list into one string
mystring = ''
for x in addr_and_key:
mystring = mystring + x
#addr_node is a list.
addr_node = re.split(r"\n", str(mystring))
#if addr_node isn't two separate items, we didn't find a document
if len(addr_node) == 1:
print("\n")
else:
addr = addr_node[0]
key = addr_node[1]
#remove b' at the beginning and ' at the end
key = key[2:-1]
#get the node based on the address from array A
node = A[int(addr)]
#turn key back into bytes and use Fernet function to
# decrypt back to plaintext
decrypted_node = Fernet(str.encode(key)).decrypt(node)
#remove b' at the beginning and ' at the end
d_n = str(decrypted_node)[2:-1]
split_node = re.split(r"\\n", d_n)
doc_id = split_node[0]
key = split_node[1]
addr = split_node[2]
list_of_docs.append(doc_id)
#Repeat iterating while the address is not null, meaning
# there are still documents with the keyword
while addr != 'None':
key = key[2:-1]
key = str.encode(key)
node = A[int(addr)]
decrypted_node = Fernet(key).decrypt(node)
d_n = str(decrypted_node)[2:-1]
split_node = re.split(r"\\n", d_n)
doc_id = split_node[0]
key = split_node[1]
addr = split_node[2]
list_of_docs.append(doc_id)
return list_of_docs
def intialization2():
filenames = ["data2.txt"]
# finds the occurence of each word in a flle
filedata = []
docnum=0
linenum=0
for idx, val in enumerate(filenames):#val is the name of the file
cnt = Counter()
for line in open(filenames[idx], 'r'):#這裏的line經過測試是按照一行一行讀的,每一行有三個數字
a=1
linenum+=1
word_list = line.replace(',', '').replace('\'', '').replace('.', '').lower().split()
for data in word_list:
if a==1:#說明這個提取的詞是文檔的編號
if linenum!=1:
if str(doc)!=str(data):
filedata.append((doc, cnt))
del cnt
cnt = Counter()
docnum+=1
doc=data
a+=1
continue
if a==2:#說明這個提取的詞是單詞的編碼
term=data
a+=1
continue
if a==3:
fre=data
cnt[term] = fre
#print("first line"+data)
filedata.append((doc, cnt))
#print(filedata)
allwords = []
for idx, val in enumerate(filedata):
for value, count in val[1].most_common(2):
if value not in allwords:
allwords.append(value)
#print(allwords)
# makes a dictory with the distinct word as index and a value of a list of filenames
word_dict = defaultdict(list)
for i in allwords:
word_dict[i]
for idx, val in enumerate(filedata):
if i in val[1]: # val[1] is allwords of the value
word_dict[i].append(val[0]) # val[0] is the name of file
word_dict[i].sort()
#print(word_dict)
return word_dict
if __name__ == '__main__':
main()