【Kesci】【正式賽】2019中國高校計算機大賽——大數據挑戰賽（基於FastText的新聞點擊率預測qauc=0.558）

原創

2019-07-01 03:52

比賽連接 https://www.kesci.com/home/competition/5cc51043f71088002c5b8840

正式賽題——文本點擊率預估（5月26日開賽）
搜索中一個重要的任務是根據query和title預測query下doc點擊率，本次大賽參賽隊伍需要根據脫敏後的數據預測指定doc的點擊率，結果按照指定的評價指標使用在線評測數據進行評測和排名，得分最優者獲勝。

直接上代碼了（部分代碼參考了討論區的分享）

# 數據集處理，轉化成fasttext需要的格式
import csv
with open('/home/kesci/work/labeled_content', 'w') as f:
    with open('/home/kesci/input/bytedance/first-round/train.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            query = row[1]
            title = row[3]
            label = row[4]
            f.write("__label__{0} {1} {2}\n".format(label, query, title))
        print(f'Processed {line_count} lines.')


# 劃分訓練集和驗證集
!head -n 90000 labeled_content > train.txt
!tail -n 10000 labeled_content > valid.txt

# 亂序訓練集

# 訓練並查看效果
from fastText import train_supervised
from fastText import load_model
classifier = train_supervised(input='/home/kesci/work/shuffled.csv',loss='hs', wordNgrams = 5, bucket = 5500000,
lr=0.5)
def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))
print_results(*classifier.test("/home/kesci/work/valid.txt"))
classifier.save_model("/home/kesci/work/model.bin")

# 使用模型進行預測並將結果持久化
import csv
from fastText import load_model
loaded_model = load_model("/home/kesci/work/modelhswn5b55.bin")
with open('/home/kesci/work/resulthswn5b55.csv', 'w') as f:
    with open('/home/kesci/input/bytedance/first-round/test.csv') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            query_id = row[0]
            query_title_id = row[2]
            prediction = loaded_model.predict(row[1] + ' ' + row[3])
            pred = prediction[1][0]
            type = prediction[0][0]
            if(type=='__label__0'):
                pred = 1- pred
            f.write("{0},{1},{2}\n".format(query_id, query_title_id, pred))

fasttext確實好用，訓練階段兩小時左右就有結果了。代碼和參數都分享出來供大家參考。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Kesci】【正式賽】2019中國高校計算機大賽——大數據挑戰賽（基於FastText的新聞點擊率預測qauc=0.558）

HTTP URL 詳解

FastText在商品分類下的應用（第十屆服創大賽全國三等獎）

5步上傳本地項目至Github倉庫

【Kesci】【預選賽】2019中國高校計算機大賽——大數據挑戰賽（基於FastText的文本情感分類）

Python 學習筆記

Pytest 學習筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結