D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass ‘sort=True’.
To retain the current behavior and silence the warning, pass sort=False
“”“Entry point for launching an IPython kernel.
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“”“Entry point for launching an IPython kernel.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
train['all_text'][0:5]
0 simpson strong-ti angl . angl make joint stron…
1 simpson strong-ti angl . angl make joint stron…
2 behr premium textur deckov tugboat wood concre…
3 delta vero shower faucet trim kit chrome (valv…
4 delta vero shower faucet trim kit chrome (valv…
Name: all_text, dtype: object
###生成語料
根據train中的all_text生成語料,先使用tokenize將句子分成一個個單詞;再用gensim.corpora.Dictionary來實現對語料中的每一個單詞關聯一個唯一的ID,這個字典定義了我們要處理的所有單詞表。
from gensim.utils import tokenize
#使用的是gensim庫中的tokenize方法,所以後續對test的處理要一致from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in train['all_text'].values)
print(dictionary)
classcorpus:def__iter__(self):for x in train['all_text'].values:
yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))
#dictionary.doc2bow是獲得語料的向量表達式
train_corpus = corpus()
count=0for c in train_corpus:
print(c)
count+=1if count >2:
break
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“”“Entry point for launching an IPython kernel.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“”“Entry point for launching an IPython kernel.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“”“Entry point for launching an IPython kernel.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“”“Entry point for launching an IPython kernel.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
“”“Entry point for launching an IPython kernel.
(128,)
獲得一個單詞的w2c後,對於一個句子,我們可以取句子所有單詞的平均值作爲句子的w2c向量。
# 得到任意text的vectordefget_vector(text):# 全是0的,大小是128的array
res =np.zeros([128])
count = 0for word in word_tokenize(text):
if word in vocab:
res += w2c[word]
count += 1return res/count
w2c_cos_sim('hello world','hello from the other side')
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
0.07032644504070107
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:8: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:10: RuntimeWarning: invalid value encountered in true_divide
# Remove the CWD from sys.path while we load stuff.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“”“Entry point for launching an IPython kernel.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
after removing the cwd from sys.path.
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“””