可以基於用戶搜索關鍵詞數據爲用戶打賞標籤
比如年齡,性別,學歷
這個的整體流程如下:
(一)數據預處理
- 編碼方式轉換
- 對數據搜索內容進行分詞
- 詞性過濾
- 數據檢查
(二)特徵選擇 - 建立word2vec詞向量模型
- 對所有搜索數據求平均向量
(三)建模預測 - 不同機器學習模型對比
- 堆疊模型
將原始數據轉換成utf-8編碼,防止後續出現各種編碼問題
以下代碼基於1w的數據
import csv
#原始數據存儲路徑
data_path = './data/user_tag_query.10W.TRAIN'
#生成數據路徑
csvfile = open(data_path + '-1w.csv', 'w')
writer = csv.writer(csvfile)
writer.writerow(['ID', 'age', 'Gender', 'Education', 'QueryList'])
#轉換成utf-8編碼的格式
with open(data_path, 'r',encoding='gb18030',errors='ignore') as f:
lines = f.readlines()
for line in lines[0:10000]:
try:
line.strip()
data = line.split("\t")
writedata = [data[0], data[1], data[2], data[3]]
querystr = ''
data[-1]=data[-1][:-1]
for d in data[4:]:
try:
cur_str = d.encode('utf8')
cur_str = cur_str.decode('utf8')
querystr += cur_str + '\t'
except:
continue
#print (data[0][0:10])
querystr = querystr[:-1]
writedata.append(querystr)
writer.writerow(writedata)
except:
#print (data[0][0:20])
continue
測試集的編碼轉換方式同上
data_path = './data/user_tag_query.10W.TEST'
csvfile = open(data_path + '-1w.csv', 'w')
writer = csv.writer(csvfile)
writer.writerow(['ID', 'QueryList'])
with open(data_path, 'r',encoding='gb18030',errors='ignore') as f:
lines = f.readlines()
for line in lines[0:10000]:
try:
data = line.split("\t")
writedata = [data[0]]
querystr = ''
data[-1]=data[-1][:-1]
for d in data[1:]:
try:
cur_str = d.encode('utf8')
cur_str = cur_str.decode('utf8')
querystr += cur_str + '\t'
except:
#print (data[0][0:10])
continue
querystr = querystr[:-1]
writedata.append(querystr)
writer.writerow(writedata)
except:
#print (data[0][0:20])
continue
要注意的是測試集和訓練集的處理方式應該是一樣的