python中使用Word2Vec多核技術進行新聞詞向量訓練

原創

2020-02-23 08:08

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')
X,y=news.data,news.target

from bs4 import BeautifulSoup

#導入nltk和re工具包

import nltk,re

#定義一個函數名爲news_to_sentences將新聞中的句子逐一剝離出來，並返回一個句子的列表


def  news_to_sentences(news):
     news_text = BeautifulSoup(news,"html5lib").get_text()
     tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
     raw_sentences = tokenizer.tokenize(news_text)
     sentences=[]
     for sent in raw_sentences:
           sentences.append(re.sub('[^a-zA-Z]',' ',sent.lower().strip()).split())
     return sentences


sentences=[]

for x in X:
       sentences += news_to_sentences(x)

#從gensim.models裏導入word2vec
from gensim.models import word2vec

#配置詞向量的維度

num_features = 300
#保證被考慮的詞彙的頻度

min_word_count = 20

#設定並行化訓練使用CPU計算核心的數量，多核可用

num_workers = 2


#定義訓練詞向量的上下文窗口大小
 



context = 5
downsampling = 1e-3


from gensim.models  import  word2vec

model = word2vec.Word2Vec(sentences,workers = num_workers,\
                          size = num_features,min_count=min_word_count,\
                          window = context,sample = downsampling)
model.init_sims(replace=True)

print(model.most_similar('hello'))

print(model.most_similar('email'))
print('end')

BeautifulSoup(news).get_text() 函數調用會出現警告信息，

Warning (from warnings module):
File "D:\Python35\lib\site-packages\bs4\__init__.py", line 181
markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP}) to this:
加入html5lib參數，如下

BeautifulSoup(news,"html5lib").get_text() 輸出結果如下：

>>> print(model.most_similar('email'))
[('mail', 0.7399873733520508), ('contact', 0.6850252151489258), ('address', 0.6711879968643188), ('sas', 0.6611512303352356), ('replies', 0.6424497365951538), ('mailed', 0.6364169716835022), ('request', 0.6355448961257935), ('compuserve', 0.6323468685150146), ('send', 0.6153897047042847), ('internet', 0.59690260887146)]
>>> print(model.most_similar('hello'))
[('hi', 0.8492101430892944), ('netters', 0.6953952312469482), ('pl', 0.6211292147636414), ('dear', 0.5891242027282715), ('nh', 0.5402401685714722), ('scotia', 0.5400180220603943), ('tin', 0.5357101559638977), ('elm', 0.5321102142333984), ('greetings', 0.5246435403823853), ('hanover', 0.5063780546188354)]
>>>

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python中使用Word2Vec多核技術進行新聞詞向量訓練

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Linux和windows的telnet登錄服務

python中使用4次多項式迴歸模型在訓練樣本中進行擬合

使用logisticregression迴歸算法訓練部分，全部樣本預測良/惡性腫瘤

python中使用超參數估計法結合特徵篩選的方法提升決策樹的預測性能

python顯示手寫數字圖片經pca壓縮後的二維空間分佈程序錯誤分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結