比賽連接 https://www.kesci.com/home/competition/5cc51043f71088002c5b8840
正式賽題——文本點擊率預估(5月26日開賽)
搜索中一個重要的任務是根據query和title預測query下doc點擊率,本次大賽參賽隊伍需要根據脫敏後的數據預測指定doc的點擊率,結果按照指定的評價指標使用在線評測數據進行評測和排名,得分最優者獲勝。
直接上代碼了(部分代碼參考了討論區的分享)
# 數據集處理,轉化成fasttext需要的格式
import csv
with open('/home/kesci/work/labeled_content', 'w') as f:
with open('/home/kesci/input/bytedance/first-round/train.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
query = row[1]
title = row[3]
label = row[4]
f.write("__label__{0} {1} {2}\n".format(label, query, title))
print(f'Processed {line_count} lines.')
# 劃分訓練集和驗證集
!head -n 90000 labeled_content > train.txt
!tail -n 10000 labeled_content > valid.txt
# 亂序訓練集
# 訓練並查看效果
from fastText import train_supervised
from fastText import load_model
classifier = train_supervised(input='/home/kesci/work/shuffled.csv',loss='hs', wordNgrams = 5, bucket = 5500000,
lr=0.5)
def print_results(N, p, r):
print("N\t" + str(N))
print("P@{}\t{:.3f}".format(1, p))
print("R@{}\t{:.3f}".format(1, r))
print_results(*classifier.test("/home/kesci/work/valid.txt"))
classifier.save_model("/home/kesci/work/model.bin")
# 使用模型進行預測並將結果持久化
import csv
from fastText import load_model
loaded_model = load_model("/home/kesci/work/modelhswn5b55.bin")
with open('/home/kesci/work/resulthswn5b55.csv', 'w') as f:
with open('/home/kesci/input/bytedance/first-round/test.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
query_id = row[0]
query_title_id = row[2]
prediction = loaded_model.predict(row[1] + ' ' + row[3])
pred = prediction[1][0]
type = prediction[0][0]
if(type=='__label__0'):
pred = 1- pred
f.write("{0},{1},{2}\n".format(query_id, query_title_id, pred))
fasttext確實好用,訓練階段兩小時左右就有結果了。代碼和參數都分享出來供大家參考。