kaggle Home Depot relevance相關性預測

#Home Depot 產品相關性預測 kaggle競賽：https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美國一家傢俱建材商品網站，用戶通過在搜索框中輸入關鍵詞，得到相關商品和服務，如輸入floor，得到不同材料的地板商品、地板清洗商品、地板安裝服務等。kaggle競賽目的是通過設計一種模型，能夠更好的匹配用戶搜索關鍵詞，得到相關性更高的產品和服務。 ##導入所需

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor

df_train = pd.read_csv('train.csv',encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv',encoding='ISO-8859-1')

#除了train test數據外，還有一個商品描述數據
df_desc = pd.read_csv('product_descriptions.csv')

#看一下各數據的樣子
df_train.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_uid	product_title	search_term	relevance
0	2	100001	Simpson Strong-Tie 12-Gauge Angle	angle bracket	3.0
1	3	100001	Simpson Strong-Tie 12-Gauge Angle	l bracket	2.5
2	9	100002	BEHR Premium Textured DeckOver 1-gal. #SC-141 …	deck over	3.0

df_test.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_uid	product_title	search_term
0	1	100001	Simpson Strong-Tie 12-Gauge Angle	90 degree bracket
1	4	100001	Simpson Strong-Tie 12-Gauge Angle	metal l brackets
2	5	100001	Simpson Strong-Tie 12-Gauge Angle	simpson sku able

df_desc.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	product_uid	product_description
0	100001	Not only do angles make joints stronger, they …
1	100002	BEHR Premium Textured DECKOVER is an innovativ…
2	100003	Classic architecture meets contemporary design…

train中relevance是我們要在test上預測的目標，relevance 1-3代表相關程度，3最高，1最低；search_term是搜索詞，即該產品在某一搜索詞下的相關度是多少；product discription裏是對應產品id的產品介紹。

對train和test數據進行合併方便處理，同時在描述數據中product_uid是共同特徵，也可合併進去。

df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
#兩個表的index都沒有實際含義，選擇忽視，axis=0按行合併

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass ‘sort=True’. To retain the current behavior and silence the warning, pass sort=False “”“Entry point for launching an IPython kernel.

df_all.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term
0	2	Simpson Strong-Tie 12-Gauge Angle	100001	3.0	angle bracket
1	3	Simpson Strong-Tie 12-Gauge Angle	100001	2.5	l bracket
2	9	BEHR Premium Textured DeckOver 1-gal. #SC-141 …	100002	3.0	deck over

df_all.shape

(240760, 5)

df_all = df_all.merge(df_desc,on='product_uid',how='left')

df_all.head(3)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description
0	2	Simpson Strong-Tie 12-Gauge Angle	100001	3.0	angle bracket	Not only do angles make joints stronger, they …
1	3	Simpson Strong-Tie 12-Gauge Angle	100001	2.5	l bracket	Not only do angles make joints stronger, they …
2	9	BEHR Premium Textured DeckOver 1-gal. #SC-141 …	100002	3.0	deck over	BEHR Premium Textured DECKOVER is an innovativ…

##文本預處理

from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

###Stemmer詞幹提取因爲homedepot做的是搜索匹配，所以文本的統一性很重要，我們需要對文本特徵做stemmer,提取詞幹,保證search term在文本中只有一種表達效果。

#去掉停止詞
stop = stopwords.words('english')

#去掉數字
import re 
def hasnumber(input_str):
    return bool(re.search(r'\d',input_str))

#整合在一起
def check(string):
    if string in stop:
        return False
    elif hasnumber(string):
        return False
    else:
        return True

#清潔文本內容
stemmer = SnowballStemmer('english')

#提取詞幹
def text_stemmer(s):
     return ' '.join([stemmer.stem(word) for word in s.lower().split() if check(word)])

#應用
df_all['search_term'] = df_all['search_term'].map(lambda x: text_stemmer(x))

df_all['product_title'] = df_all['product_title'].map(lambda x:text_stemmer(x))

df_all['product_description'] = df_all['product_description'].map(lambda x:text_stemmer(x))

df_all.head()

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description
0	2	simpson strong-ti angl	100001	3.00	angl bracket	angl make joint stronger, also provid consiste…
1	3	simpson strong-ti angl	100001	2.50	l bracket	angl make joint stronger, also provid consiste…
2	9	behr premium textur deckov tugboat wood concre…	100002	3.00	deck	behr premium textur deckov innov solid color c…
3	16	delta vero shower faucet trim kit chrome (valv…	100005	2.33	rain shower head	updat bathroom delta vero single-handl shower …
4	17	delta vero shower faucet trim kit chrome (valv…	100005	2.67	shower faucet	updat bathroom delta vero single-handl shower …

###處理訓練數據

#對訓練數據構造全部單詞合集
train = df_all[:df_train.shape[0]]
test = df_all[df_test.shape[0]:]

train['all_text']=train['product_title'] + ' . ' + train['product_description'] + ' . '
test['all_text'] = test['product_title'] + ' . ' + test['product_description'] + ' . '

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

train['all_text'][0:5]

0 simpson strong-ti angl . angl make joint stron… 1 simpson strong-ti angl . angl make joint stron… 2 behr premium textur deckov tugboat wood concre… 3 delta vero shower faucet trim kit chrome (valv… 4 delta vero shower faucet trim kit chrome (valv… Name: all_text, dtype: object ###生成語料根據train中的all_text生成語料，先使用tokenize將句子分成一個個單詞；再用gensim.corpora.Dictionary來實現對語料中的每一個單詞關聯一個唯一的ID，這個字典定義了我們要處理的所有單詞表。

from gensim.utils import tokenize
#使用的是gensim庫中的tokenize方法，所以後續對test的處理要一致
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in train['all_text'].values)
print(dictionary)

D:\programs\anaconda\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn(“detected Windows; aliasing chunkize to chunkize_serial”) Dictionary(136703 unique tokens: [‘alonehelp’, ‘also’, ‘angl’, ‘bent’, ‘coat’]…) 我們得到一個有136703個單詞的訓練語料庫，然後對所有語料轉換成單詞個數的計算。這裏使用迭代器來寫一個類，實現對所有語料的每一個單詞進行個數計算，因爲語料庫很大，直接生成list會很費內存。

class corpus:
    def __iter__(self):
        for x in train['all_text'].values:
            yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
#dictionary.doc2bow是獲得語料的向量表達式

train_corpus = corpus()

count=0
for c in train_corpus:
    print(c)
    count+=1
    if count >2:
        break

[(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(0, 1), (1, 1), (2, 4), (3, 1), (4, 1), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 4), (22, 1), (23, 2), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 3), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 2), (44, 1), (45, 2), (46, 1), (47, 1), (48, 1), (49, 2), (50, 3), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 3), (62, 1), (63, 1), (64, 1)] [(1, 1), (4, 3), (21, 1), (25, 1), (39, 1), (41, 2), (56, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 4), (73, 2), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 4), (86, 2), (87, 1), (88, 1), (89, 1), (90, 2), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 1), (114, 1), (115, 1), (116, 1), (117, 2), (118, 1), (119, 1), (120, 1), (121, 1), (122, 1), (123, 2), (124, 2), (125, 1), (126, 1), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 2), (136, 1), (137, 1), (138, 1), (139, 1), (140, 2), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 2), (149, 1), (150, 1), (151, 1), (152, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 4), (158, 1)] 如上所示，每個句子中的每個單詞轉換爲一組向量,()中的第一個元素表示該詞在字典中的ID，第二個元素表示在這個句子中這個單詞出現的次數。 ###使用TF-IDF模型 tf-idf模型簡單理解是把詞袋錶達的向量轉換到另一個向量空間，這個向量空間中，詞頻是根據語料中每個詞的相對稀有程度（relative rarity）進行加權處理的。 TF(term frequency)=某個詞在文中出現的次數/文章總詞數 IDF( inverse document frequency)=log(N/N(x))，其中N表示語料庫中文本總數，N(x)表示語料庫中包含x的文本總數。 TF-IDF(x) = Tf(x)*IDF(x)

from gensim.models.tfidfmodel import TfidfModel
tfidf_g = TfidfModel(train_corpus)

#保存模型
tfidf_g.save("./gensim_tfidf.tfidf")

訓練好後，我們評估一條普通的句子：

tfidf_g[dictionary.doc2bow(list(tokenize('morning yellow flower', errors='ignore')))]

[(1056, 0.44640344500226231), (1332, 0.40452528743632266), (34490, 0.79817495332456578)] 返回的元組中，第一個表示單詞ID，第二個表示權重。

#進行封裝
def to_tfidf(text):
    res = tfidf_g[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
    return res

###餘弦相似度餘弦值越接近1，就表明夾角越接近0度，也就是兩個向量越相似。

from gensim.similarities import MatrixSimilarity
def cos_sim(text1,text2):
    tf1 = to_tfidf(text1)
    tf2 = to_tfidf(text2)
    index = MatrixSimilarity([tf1],num_features=len(dictionary))
    sim = index[tf2]
    return float(sim[0])

先將兩條文本根據訓練好的tfidf模型轉換爲向量,拿其中一個作爲index,擴展開全部的matrixsize，另一個帶入，得到二者餘弦值。

train['tfidf_cos_sim_in_title'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)

test['tfidf_cos_sim_in_title'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_title']), axis=1)

train['tfidf_cos_sim_in_desc'] = train.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)

test['tfidf_cos_sim_in_desc'] = test.apply(lambda x: cos_sim(x['search_term'], x['product_description']), axis=1)

train.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc
0	2	simpson strong-ti angl	100001	3.0	angl bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.287958	0.188301
1	3	simpson strong-ti angl	100001	2.5	l bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.000000	0.000000

test.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc
166693	138080	winix freshom model true hepa air cleaner plas…	149579	NaN	winix air purifi	winix freshom true hepa air cleaner plasmawav …	winix freshom model true hepa air cleaner plas…	0.413249	0.146256
166694	138082	ge rv outlet box amp volt ring type meter amp …	149580	NaN	gcfi outlet	ring-typ meter surfac mount factory-assembl fa…	ge rv outlet box amp volt ring type meter amp …	0.530571	0.022509

可以看出我們增加了2個特徵，即我們通過TF-IDF模型將單詞轉換爲向量表示，計算兩個向量的餘弦值作爲相似性度量，度量search item與title和product description的相似性。

###word2vec模型 ####tokenize分詞

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#一段文章分成各個句子
tokenizer.tokenize(train['all_text'].values[0])

[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]

#應用到train樣本
sentences = [tokenizer.tokenize(x) for x in train['all_text'].values]

type(sentences)#得到list of句子

list

sentences[:2]

[[‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’], [‘simpson strong-ti angl .’, ‘angl make joint stronger, also provid consistent, straight corners.’, ‘simpson strong-ti offer wide varieti angl various size thick handl light-duti job project structur connect needed.’, ‘bent (skewed) match project.’, ‘outdoor project moistur present, use zmax zinc-coat connectors, provid extra resist corros (look “z” end model number).versatil connector various connect home repair projectsstrong angl nail screw fasten alonehelp ensur joint consist straight strongdimensions: in.’, ‘x in.’, ‘x in.mad steelgalvan extra corros resistanceinstal common nail x in.’, ‘strong-driv sd screw .’]] 分割成句子後，這些句子還是有層級關係的，我們想要得到的是所有句子的集合，即需要將句子list flattern.查詢Stack Overflow方法如下：

sentences = [y for x in sentences for y in x]

其等價於： flattern=[] for sub in sentences: for val in sub: flattern.append(val) 但上面方法運行更快，且不用調用append

len(sentences)

606641

#各個句子分成單詞
words = [word_tokenize(x) for x in sentences]

#### w2c model

from gensim.models.word2vec import Word2Vec

w2c = Word2Vec(words, size=128, window=5, min_count=5, workers=4)

這時，語料中的單詞會有一個w2c向量表示：

w2c['door'].shape

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). “”“Entry point for launching an IPython kernel. (128,) 獲得一個單詞的w2c後，對於一個句子，我們可以取句子所有單詞的平均值作爲句子的w2c向量。

vocab = w2c.wv.vocab
#注：對於gensim 1.0.0後的版本，要想獲得w2c向量字典，需要使用方法model.wv.vocab
#對於以前的版本，使用model.vocab即可

type(vocab)

dict

# 得到任意text的vector
def get_vector(text):
    # 全是0的,大小是128的array
    res =np.zeros([128])
    count = 0
    for word in word_tokenize(text):
        if word in vocab:
            res += w2c[word]
            count += 1
    return res/count

print(get_vector('this is a door'))

[-0.00261087 -0.56179226 0.60765644 -0.64292271 -0.56054996 0.08848376 -0.49025596 0.63867774 0.0506447 0.45001359 -0.13710753 -0.16271916 0.02016276 0.20406312 0.31635891 0.19369102 0.17811321 0.41733303 0.00445884 0.5458078 1.05040102 -0.06413073 0.41070253 0.42587531 -0.63050625 -1.0984747 0.29934129 0.17861572 -0.71340695 -0.06451187 0.14277897 -0.06567481 0.01526162 -0.38790436 1.20415058 1.21037786 0.14057088 -0.10719017 0.37104489 0.76831334 0.34643462 0.62355396 0.25301299 0.40690951 0.1148672 1.06050375 0.36682158 0.25096587 -0.74231262 0.35016962 -0.58686608 -0.0857836 -0.84342213 0.5809405 -0.00302781 -0.14390172 -0.0524666 -0.91113859 0.75996059 0.87425374 -0.26513928 -0.54596879 0.80864939 0.01382558 -0.06432911 0.4952433 -0.43694797 0.01296244 0.84968186 -0.10620818 -0.18429637 0.69937535 0.4414333 -0.13501882 0.02398617 -0.47228654 1.04885393 0.06891993 -0.38115454 0.34773821 0.31407464 -0.06125381 -0.52234665 -0.11498543 -0.03274459 -0.10401297 -0.58666455 0.96296111 -0.72077985 0.29961426 0.68775976 -0.03572528 0.28445438 0.04369911 0.61288889 -0.21892426 -0.05004786 -0.73410231 -0.58521137 -0.02520149 -0.44890615 -0.54609256 0.86551609 -0.28756648 0.4514165 -0.36830674 0.31522632 -0.05346495 -0.0451854 -0.26681575 -0.46619874 0.04488686 -0.38999537 -0.12920142 0.68646752 -0.86762143 0.79189127 0.25246545 -1.17674513 -0.17455208 0.34669894 -0.31969217 -0.08517368 0.52510963 0.2380766 -0.16318383 0.17944744 -0.9727306 ] D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). ###計算相似度

from scipy import spatial
def w2c_cos_sim(text1,text2):
    try:
        w1 = get_vector(text1)
        w2 = get_vector(text2)
        sim = 1 - spatial.distance.cosine(w1, w2)
        return float(sim)
    except:
        return float(0)
#spatial.distance.cosine定義餘弦距離爲1-cos

w2c_cos_sim('hello world','hello from the other side')

train['w2v_cos_sim_in_title'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
train['w2v_cos_sim_in_desc'] = train.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)

test['w2v_cos_sim_in_title'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_title']), axis=1)
test['w2v_cos_sim_in_desc'] = test.apply(lambda x: w2c_cos_sim(x['search_term'], x['product_description']), axis=1)

D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:10: RuntimeWarning: invalid value encountered in true_divide # Remove the CWD from sys.path while we load stuff. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “”“Entry point for launching an IPython kernel. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “””

train.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc	w2v_cos_sim_in_title	w2v_cos_sim_in_desc
0	2	simpson strong-ti angl	100001	3.0	angl bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.287958	0.188301	0.531925	0.530175
1	3	simpson strong-ti angl	100001	2.5	l bracket	angl make joint stronger, also provid consiste…	simpson strong-ti angl . angl make joint stron…	0.000000	0.000000	0.279708	0.303249

test.head(2)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

	id	product_title	product_uid	relevance	search_term	product_description	all_text	tfidf_cos_sim_in_title	tfidf_cos_sim_in_desc	w2v_cos_sim_in_title	w2v_cos_sim_in_desc
166693	138080	winix freshom model true hepa air cleaner plas…	149579	NaN	winix air purifi	winix freshom true hepa air cleaner plasmawav …	winix freshom model true hepa air cleaner plas…	0.413249	0.146256	0.694583	0.591168
166694	138082	ge rv outlet box amp volt ring type meter amp …	149580	NaN	gcfi outlet	ring-typ meter surfac mount factory-assembl fa…	ge rv outlet box amp volt ring type meter amp …	0.530571	0.022509	0.676237	0.369881

#去掉文本列
train = train.drop(['search_term','product_title','product_description','all_text'],axis=1)
test = test.drop(['search_term','product_title','product_description','all_text'],axis=1)

#保留id，提取目標
ids = test['id']
y_train = train['relevance'].values


X_train = train.drop(['id','relevance'],axis=1).values
X_test = test.drop(['id','relevance'],axis=1).values

model

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

params = [1,3,5,6,7,8,9,10]
test_scores = []
for param in params:
    clf = RandomForestRegressor(n_estimators=30, max_depth=param)
    test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    test_scores.append(np.mean(test_score))

import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(params, test_scores)
plt.title("Param vs CV Error")

<matplotlib.text.Text at 0x26281610c88>

可以看出，max_depth在6的時候效果最好，大約在0.49左右。這裏是增加4個特徵，選擇的模型是隨機森林，下一步，可以構造新的特徵，如簡單的search item是否被包含，還可以使用別的模型，如LR，然後對模型進行ensemble等。

新手學習，歡迎指教！

kaggle Home Depot relevance相關性預測

model

kaggle Home Depot relevance相關性預測

selenium+Python Behave行爲驅動測試開發用例設計

使用bat快速打開Jupyter到指定目錄

Python實現http接口自動化測試

selenium+Python Page Object自動化測試

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結