kaggle Home Depot relevance相關性預測

#Home Depot 產品相關性預測 kaggle競賽:https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美國一家傢俱建材商品網站,用戶通過在搜索框中輸入關鍵詞,得到相關商品和服務,如輸入floor,得到不同材料的地板商品、地板清洗商品、地板安裝服務等。kaggle競賽目的是通過設計一種模型,能夠更好的匹配用戶搜索關鍵詞,得到相關性更高的產品和服務。 ##導入所需
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
df_train = pd.read_csv('train.csv',encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv',encoding='ISO-8859-1')
#除了train test數據外,還有一個商品描述數據
df_desc = pd.read_csv('product_descriptions.csv')
#看一下各數據的樣子
df_train.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_uid product_title search_term relevance
0 2 100001 Simpson Strong-Tie 12-Gauge Angle angle bracket 3.0
1 3 100001 Simpson Strong-Tie 12-Gauge Angle l bracket 2.5
2 9 100002 BEHR Premium Textured DeckOver 1-gal. #SC-141 … deck over 3.0
df_test.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_uid product_title search_term
0 1 100001 Simpson Strong-Tie 12-Gauge Angle 90 degree bracket
1 4 100001 Simpson Strong-Tie 12-Gauge Angle metal l brackets
2 5 100001 Simpson Strong-Tie 12-Gauge Angle simpson sku able
df_desc.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
product_uid product_description
0 100001 Not only do angles make joints stronger, they …
1 100002 BEHR Premium Textured DECKOVER is an innovativ…
2 100003 Classic architecture meets contemporary design…

train中relevance是我們要在test上預測的目標,relevance 1-3代表相關程度,3最高,1最低;search_term是搜索詞,即該產品在某一搜索詞下的相關度是多少;product discription裏是對應產品id的產品介紹。

對train和test數據進行合併方便處理,同時在描述數據中product_uid是共同特徵,也可合併進去。

df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
#兩個表的index都沒有實際含義,選擇忽視,axis=0按行合併
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass ‘sort=True’. To retain the current behavior and silence the warning, pass sort=False “”“Entry point for launching an IPython kernel.
df_all.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term
0 2 Simpson Strong-Tie 12-Gauge Angle 100001 3.0 angle bracket
1 3 Simpson Strong-Tie 12-Gauge Angle 100001 2.5 l bracket
2 9 BEHR Premium Textured DeckOver 1-gal. #SC-141 … 100002 3.0 deck over
df_all.shape
(240760, 5)
df_all = df_all.merge(df_desc,on='product_uid',how='left')
df_all.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description
0 2 Simpson Strong-Tie 12-Gauge Angle 100001 3.0 angle bracket Not only do angles make joints stronger, they …
1 3 Simpson Strong-Tie 12-Gauge Angle 100001 2.5 l bracket Not only do angles make joints stronger, they …
2 9 BEHR Premium Textured DeckOver 1-gal. #SC-141 … 100002 3.0 deck over BEHR Premium Textured DECKOVER is an innovativ…
##文本預處理
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
###Stemmer詞幹提取 因爲homedepot做的是搜索匹配,所以文本的統一性很重要,我們需要對文本特徵做stemmer,提取詞幹,保證search term在文本中只有一種表達效果。
#去掉停止詞
stop = stopwords.words('english')

#去掉數字
import re 
def hasnumber(input_str):
    return bool(re.search(r'\d',input_str))

#整合在一起
def check(string):
    if string in stop:
        return False
    elif hasnumber(string):
        return False
    else:
        return True
#清潔文本內容
stemmer = SnowballStemmer('english')
#提取詞幹
def text_stemmer(s):
     return ' '.join([stemmer.stem(word) for word in s.lower().split() if check(word)])
#應用
df_all['search_term'] = df_all['search_term'].map(lambda x: text_stemmer(x))

df_all['product_title'] = df_all['product_title'].map(lambda x:text_stemmer(x))

df_all['product_description'] = df_all['product_description'].map(lambda x:text_stemmer(x))
df_all.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description
0 2 simpson strong-ti angl 100001 3.00 angl bracket angl make joint stronger, also provid consiste…
1 3 simpson strong-ti angl 100001 2.50 l bracket angl make joint stronger, also provid consiste…
2 9 behr premium textur deckov tugboat wood concre… 100002 3.00 deck behr premium textur deckov innov solid color c…
3 16 delta vero shower faucet trim kit chrome (valv… 100005 2.33 rain shower head updat bathroom delta vero single-handl shower …
4 17 delta vero shower faucet trim kit chrome (valv… 100005 2.67 shower faucet updat bathroom delta vero single-handl shower …
###處理訓練數據
#對訓練數據構造全部單詞合集
train = df_all[:df_train.shape[0]]
test = df_all[df_test.shape[0]:]
train['all_text']=train['product_title'] + ' . ' + train['product_description'] + ' . '
test['all_text'] = test['product_title'] + ' . ' + test['product_description'] + ' . '
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
train['all_text'][0:5]
0 simpson strong-ti angl . angl make joint stron… 1 simpson strong-ti angl . angl make joint stron… 2 behr premium textur deckov tugboat wood concre… 3 delta vero shower faucet trim kit chrome (valv… 4 delta vero shower faucet trim kit chrome (valv… Name: all_text, dtype: object ###生成語料 根據train中的all_text生成語料,先使用tokenize將句子分成一個個單詞;再用gensim.corpora.Dictionary來實現對語料中的每一個單詞關聯一個唯一的ID,這個字典定義了我們要處理的所有單詞表。
from gensim.utils import tokenize
#使用的是gensim庫中的tokenize方法,所以後續對test的處理要一致
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in train['all_text'].values)
print(dictionary)
D:\programs\anaconda\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”) Dictionary(136703 unique tokens: [‘alonehelp’, ‘also’, ‘angl’, ‘bent’, ‘coat’]…) 我們得到一個有136703個單詞的訓練語料庫,然後對所有語料轉換成單詞個數的計算。這裏使用迭代器來寫一個類,實現對所有語料的每一個單詞進行個數計算,因爲語料庫很大,直接生成list會很費內存。
class corpus:
    def __iter__(self):
        for x in train['all_text'].values:
            yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
#dictionary.doc2bow是獲得語料的向量表達式
train_corpus = corpus()
count=0
for c in train_corpus:
    print(c)
    count+=1
    if count >2:
        break
[(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(1, 1), (4, 3), (21, 1), (25, 1), (39, 1), (41, 2), (56, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 4), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 4), (86, 2), (87, 1), (88, 1), (89, 1), (90, 2), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 2), (124, 2), (125, 1), (126, 1), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 2), (136, 1), (137, 1), (138, 1), (139, 1), (140, 2), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 2), (149, 1), (150, 1), (151, 1), (152, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 4), (158, 1)] 如上所示,每個句子中的每個單詞轉換爲一組向量,()中的第一個元素表示該詞在字典中的ID,第二個元素表示在這個句子中這個單詞出現的次數。 ###使用TF-IDF模型 tf-idf模型簡單理解是把詞袋錶達的向量轉換到另一個向量空間,這個向量空間中,詞頻是根據語料中每個詞的相對稀有程度(relative rarity)進行加權處理的。 TF(term frequency)=某個詞在文中出現的次數/文章總詞數 IDF( inverse document frequency)=log(N/N(x)),其中N表示語料庫中文本總數,N(x)表示語料庫中包含x的文本總數。 TF-IDF(x) = Tf(x)*IDF(x)
from gensim.models.tfidfmodel import TfidfModel
tfidf_g = TfidfModel(train_corpus)
#保存模型
tfidf_g.save("./gensim_tfidf.tfidf")
訓練好後,我們評估一條普通的句子:
tfidf_g[dictionary.doc2bow(list(tokenize('morning yellow flower', errors='ignore')))]
[(1056, 0.44640344500226231), (1332, 0.40452528743632266), (34490, 0.79817495332456578)] 返回的元組中,第一個表示單詞ID,第二個表示權重。
#進行封裝
def to_tfidf(text):
    res = tfidf_g[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
    return res
###餘弦相似度 餘弦值越接近1,就表明夾角越接近0度,也就是兩個向量越相似。
from gensim.similarities import MatrixSimilarity
def cos_sim(text1,text2):
    tf1 = to_tfidf(text1)
    tf2 = to_tfidf(text2)
    index = MatrixSimilarity([tf1],num_features=len(dictionary))
    sim = index[tf2]
    return float(sim[0])
先將兩條文本根據訓練好的tfidf模型轉換爲向量,拿其中一個作爲index,擴展開全部的matrixsize,另一個帶入,得到二者餘弦值。
train['tfidf_cos_sim_in_title'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
test['tfidf_cos_sim_in_title'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
train['tfidf_cos_sim_in_desc'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
test['tfidf_cos_sim_in_desc'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel.
train.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description all_text tfidf_cos_sim_in_title tfidf_cos_sim_in_desc
0 2 simpson strong-ti angl 100001 3.0 angl bracket angl make joint stronger, also provid consiste… simpson strong-ti angl . angl make joint stron… 0.287958 0.188301
1 3 simpson strong-ti angl 100001 2.5 l bracket angl make joint stronger, also provid consiste… simpson strong-ti angl . angl make joint stron… 0.000000 0.000000
test.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description all_text tfidf_cos_sim_in_title tfidf_cos_sim_in_desc
166693 138080 winix freshom model true hepa air cleaner plas… 149579 NaN winix air purifi winix freshom true hepa air cleaner plasmawav … winix freshom model true hepa air cleaner plas… 0.413249 0.146256
166694 138082 ge rv outlet box amp volt ring type meter amp … 149580 NaN gcfi outlet ring-typ meter surfac mount factory-assembl fa… ge rv outlet box amp volt ring type meter amp … 0.530571 0.022509

可以看出我們增加了2個特徵,即我們通過TF-IDF模型將單詞轉換爲向量表示,計算兩個向量的餘弦值作爲相似性度量,度量search item與title和product description的相似性。

###word2vec模型 ####tokenize分詞
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#一段文章分成各個句子
tokenizer.tokenize(train['all_text'].values[0])
[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]
#應用到train樣本
sentences = [tokenizer.tokenize(x) for x in train['all_text'].values]
type(sentences)#得到list of句子
list
sentences[:2]
[[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’], [‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]] 分割成句子後,這些句子還是有層級關係的,我們想要得到的是所有句子的集合,即需要將句子list flattern.查詢Stack Overflow方法如下:
sentences = [y for x in sentences for y in x]
其等價於: flattern=[] for sub in sentences: for val in sub: flattern.append(val) 但上面方法運行更快,且不用調用append
len(sentences)
606641
#各個句子分成單詞
words = [word_tokenize(x) for x in sentences]
#### w2c model
from gensim.models.word2vec import Word2Vec

w2c = Word2Vec(words, size=128, window=5, min_count=5, workers=4)
這時,語料中的單詞會有一個w2c向量表示:
w2c['door'].shape
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). “”“Entry point for launching an IPython kernel. (128,) 獲得一個單詞的w2c後,對於一個句子,我們可以取句子所有單詞的平均值作爲句子的w2c向量。
vocab = w2c.wv.vocab
#注:對於gensim 1.0.0後的版本,要想獲得w2c向量字典,需要使用方法model.wv.vocab
#對於以前的版本,使用model.vocab即可
type(vocab)
dict
# 得到任意text的vector
def get_vector(text):
    # 全是0的,大小是128的array
    res =np.zeros([128])
    count = 0
    for word in word_tokenize(text):
        if word in vocab:
            res += w2c[word]
            count += 1
    return res/count
print(get_vector('this is a door'))
[-0.00261087 -0.56179226 0.60765644 -0.64292271 -0.56054996 0.08848376 -0.49025596 0.63867774 0.0506447 0.45001359 -0.13710753 -0.16271916 0.02016276 0.20406312 0.31635891 0.19369102 0.17811321 0.41733303 0.00445884 0.5458078 1.05040102 -0.06413073 0.41070253 0.42587531 -0.63050625 -1.0984747 0.29934129 0.17861572 -0.71340695 -0.06451187 0.14277897 -0.06567481 0.01526162 -0.38790436 1.20415058 1.21037786 0.14057088 -0.10719017 0.37104489 0.76831334 0.34643462 0.62355396 0.25301299 0.40690951 0.1148672 1.06050375 0.36682158 0.25096587 -0.74231262 0.35016962 -0.58686608 -0.0857836 -0.84342213 0.5809405 -0.00302781 -0.14390172 -0.0524666 -0.91113859 0.75996059 0.87425374 -0.26513928 -0.54596879 0.80864939 0.01382558 -0.06432911 0.4952433 -0.43694797 0.01296244 0.84968186 -0.10620818 -0.18429637 0.69937535 0.4414333 -0.13501882 0.02398617 -0.47228654 1.04885393 0.06891993 -0.38115454 0.34773821 0.31407464 -0.06125381 -0.52234665 -0.11498543 -0.03274459 -0.10401297 -0.58666455 0.96296111 -0.72077985 0.29961426 0.68775976 -0.03572528 0.28445438 0.04369911 0.61288889 -0.21892426 -0.05004786 -0.73410231 -0.58521137 -0.02520149 -0.44890615 -0.54609256 0.86551609 -0.28756648 0.4514165 -0.36830674 0.31522632 -0.05346495 -0.0451854 -0.26681575 -0.46619874 0.04488686 -0.38999537 -0.12920142 0.68646752 -0.86762143 0.79189127 0.25246545 -1.17674513 -0.17455208 0.34669894 -0.31969217 -0.08517368 0.52510963 0.2380766 -0.16318383 0.17944744 -0.9727306 ] D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). ###計算相似度
from scipy import spatial
def w2c_cos_sim(text1,text2):
    try:
        w1 = get_vector(text1)
        w2 = get_vector(text2)
        sim = 1 - spatial.distance.cosine(w1, w2)
        return float(sim)
    except:
        return float(0)
#spatial.distance.cosine定義餘弦距離爲1-cos
w2c_cos_sim('hello world','hello from the other side')
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). 0.07032644504070107
train['w2v_cos_sim_in_title'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
train['w2v_cos_sim_in_desc'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)

test['w2v_cos_sim_in_title'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
test['w2v_cos_sim_in_desc'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:10: RuntimeWarning: invalid value encountered in true_divide # Remove the CWD from sys.path while we load stuff. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “””
train.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description all_text tfidf_cos_sim_in_title tfidf_cos_sim_in_desc w2v_cos_sim_in_title w2v_cos_sim_in_desc
0 2 simpson strong-ti angl 100001 3.0 angl bracket angl make joint stronger, also provid consiste… simpson strong-ti angl . angl make joint stron… 0.287958 0.188301 0.531925 0.530175
1 3 simpson strong-ti angl 100001 2.5 l bracket angl make joint stronger, also provid consiste… simpson strong-ti angl . angl make joint stron… 0.000000 0.000000 0.279708 0.303249
test.head(2)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description all_text tfidf_cos_sim_in_title tfidf_cos_sim_in_desc w2v_cos_sim_in_title w2v_cos_sim_in_desc
166693 138080 winix freshom model true hepa air cleaner plas… 149579 NaN winix air purifi winix freshom true hepa air cleaner plasmawav … winix freshom model true hepa air cleaner plas… 0.413249 0.146256 0.694583 0.591168
166694 138082 ge rv outlet box amp volt ring type meter amp … 149580 NaN gcfi outlet ring-typ meter surfac mount factory-assembl fa… ge rv outlet box amp volt ring type meter amp … 0.530571 0.022509 0.676237 0.369881
#去掉文本列
train = train.drop(['search_term','product_title','product_description','all_text'],axis=1)
test = test.drop(['search_term','product_title','product_description','all_text'],axis=1)
#保留id,提取目標
ids = test['id']
y_train = train['relevance'].values

X_train = train.drop(['id','relevance'],axis=1).values
X_test = test.drop(['id','relevance'],axis=1).values

model

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
params = [1,3,5,6,7,8,9,10]
test_scores = []
for param in params:
    clf = RandomForestRegressor(n_estimators=30, max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error")
<matplotlib.text.Text at 0x26281610c88>

這裏寫圖片描述

可以看出,max_depth在6的時候效果最好,大約在0.49左右。這裏是增加4個特徵,選擇的模型是隨機森林,下一步,可以構造新的特徵,如簡單的search item是否被包含,還可以使用別的模型,如LR,然後對模型進行ensemble等。


新手學習,歡迎指教!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章