最新中文文本挖掘小例子及程序

原創

robinliu2010

2020-06-28 15:49

http://bbs.pinggu.org/thread-853290-1-1.html

中文分詞：
因爲TM和openNLP對中文支持不好，所以這裏的分詞軟件採用imdict-chinese-analyzer它是中科院張華平博士開發的一款基於HHMM的智能分詞軟件
分詞效果：

zw <- c("如果你聽到某人說他使用某軟體，然後看看效果，有些美中不足，那就叫《星光燦爛》吧！thus do not have the texts already
           stored on a hard disk, and want to save the text documents to disk")
1、去停用詞：
zwfc(zw,zj1)
[1] "聽某人說使用軟體看看效果美中不足星光燦爛 thu text alreadi store hard disk save text document
disk time: 0.109 s"
2、不去停用詞：
zwfc(zw,zj1)
[1] "如果你聽到某人說他使用某軟體 , 然後看看效果 , 有些美中不足 , 那就叫 , 星光燦爛 , 吧
, thu do not have the text alreadi store on a hard disk , and want to save the text document to disk time: 0.0
s"

中文分詞對人名地名分解的仍然不好，大多分解成單字。

下面是個簡單例子：
一、安裝TM和rJava包，併到SUN網站安裝JAVA運行環境軟件包。
二、將下面的壓縮包解壓到c盤根目錄。
三、在R中運行軟件。

結果：

共五個文件：
$FileList
[1] "c:/text/荷蘭隊長上演驚天遠射.txt"
[2] "c:/text/技術化轉型路上德國人受重創.txt"
[3] "c:/text/普約爾貢獻頭球絕殺.txt"
[4] "c:/text/四大天王沉淪各有難唸的經.txt"
[5] "c:/text/再戰德班德西命運迥異.txt"
-----------------------------------------
1、找出最少出現過5次的詞條 ##
> findFreqTerms(dtm, 5)
[1] "烏拉圭" "西班牙"
--------------------------------------------
2、找出與"西班牙"相關度至少達0.8的詞條 ###
> findAssocs(dtm, "西班牙", 0.8)
西班牙德意志
1.00   0.92
--------------------------------------------
去掉較少詞頻（40%以下）的詞條後詞條-文件矩陣
inspect(removeSparseTerms(dtm, 0.4))
A document-term matrix (5 documents, 5 terms)
Non-/sparse entries: 22/3
Sparsity           : 12%
Maximal term length: 5
Weighting          : term frequency (tf)
     Terms
Docs 0.0 time: 半決賽世界盃西班牙
    1   0     1      1      2      0
    2   1     1      1      1      5
    3   1     1      1      2      4
    4   1     1      0      3      1
    5   1     1      1      1      7
----------------------------------------
### 詞典 ### 它通常用來表示文本挖掘有關詞條
A document-term matrix (5 documents, 3 terms)
Non-/sparse entries: 13/2
Sparsity           : 13%
Maximal term length: 3
Weighting          : term frequency (tf)
     Terms
Docs 半決賽世界盃西班牙
    1      1      2      0
    2      1      1      5
    3      1      2      4
    4      0      3      1
    5      1      1      7

本文來自: 人大經濟論壇 S-Plus&R專版版，詳細出處參考： http://bbs.pinggu.org/forum.php?mod=viewthread&tid=853290&page=1

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

最新中文文本挖掘小例子及程序

linepipe——又一個自然語言開源程序

django模型5

LDA必看文章

最新中文文本挖掘小例子及程序

運行python manage.py runserver報錯現象、原因和解決辦法

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結