Document Filtering


Document filtering demonstrates how to classify documents based on their contents. Perhaps the most useful and well-known application of document filtering is the elimination of spam.

It recognize whether a document belongs in one category or another, they can be used for less unsavory purposes.


Filtering Spam

Early attempts to filter spam were all rule-based classifiers, where a person would design a set of rules that was supposed to indicate whether or not a message was spam.Rule-based classifiers quickly became apparent—spammers learned all the rules and stopped exhibiting the obvious behaviors to get around the filters.

The other problem with rule-based filters is that what can be considered spam varies depending on where it’s being posted and for whom it is being written. Keywords that would strongly indicate spam for one particular user, message board, or Wiki may be quite normal for others.

Documents and Words

The classifier that you will be building needs features to use for classifying different
items. A feature is anything that you can determine as being either present or absent
in the item. When considering documents for classification, the items are the documents
and the features are the words in the document.

Determining which features to use is both very tricky and very important.

Training the classifier

The first thing you’ll need is a class to represent the classifier. This class will encapsulate
what the classifier has learned so far. The advantage of structuring the module
this way is that you can instantiate multiple classifiers for different users, groups, or
queries, and train them differently to respond to a particular group’s needs.


Three instance variables: fc, cc and getfeatures. 

1) The fc variable will store the counts for different features in different classifications. For example:
{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}

It shows the frequencies of the word "python" used in spam(bad) and others(good). 

2) The cc variable is a dictionary of how many times every classification has been used.

3) getfeatures, is the function that will be used to extract the features from the items being classified.


The train method takes an item (a document in this case) and a classification. It uses the getfeatures function of the class to break the item into its separate features. It then calls incf to increase the counts for this classification for every feature. Finally, it increases the total count for this classification:


Calculating Probabilities


>>> cl.fprob('quick','good')

0.66666666666666663


You can see that the word “quick”(feature) appears in two of the three documents classified as good(category).

when you have very little information about the feature in question. A good number to start with is 0.5.


Navie Bayes Classiifier

This method is called naïve because it assumes that the probabilities being combined
are independent of each other.


This is actually a false assumption, since you’ll probably find that documents
containing the word “casino” are much more likely to contain the word
“money” than documents about Python programming are.

To use the naïve Bayesian classifier, you’ll first have to determine the probability of
an entire document being given a classification.


For example, suppose you’ve noticed that the word “Python” appears in 20 percent
of your bad documents—Pr(Python | Bad) = 0.2—and that the word “casino”
appears in 80 percent of your bad documents (Pr(Casino | Bad) = 0.8). You would
then expect the independent probability of both words appearing in a bad document—
Pr(Python & Casino | Bad)—to be 0.8 × 0.2 = 0.16. From this you can see
that calculating the entire document probability is just a matter of multiplying
together all the probabilities of the individual words in that document.


沒看懂到底什麼是Navie Bayes分類,還是看個易懂的吧:


這個定理解決了現實生活裏經常遇到的問題:已知某條件概率,如何得到兩個事件交換後的概率,也就是在已知P(A|B)的情況下如何求得P(B|A)。


貝葉斯定理之所以有用,是因爲我們在生活中經常遇到這種情況:我們可以很容易直接得出P(A|B),P(B|A)則很難直接得出,但我們更關心P(B|A),貝葉斯定理就爲我們打通從P(A|B)獲得P(B|A)的道路。公式如下:

P(A/B)=P(AB)/P(B)

P(B/A)=P(AB)/P(A)=P(A/B)P(B)/P(A)


http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

看完這個blog給的SNS真實賬號識別的例子,應該就懂了。。。

這個例子也展示了當特徵屬性充分多時,樸素貝葉斯分類對個別屬性的抗干擾性。


輸入:

1) 特徵屬性: X={a1, a2, a3。。。an}

2)分類 :       Y={Y1 , Y2,... Ym}

3)   樣本


輸出:

給定某一個體x,求x屬於哪個分類Yk


舉例:假設判斷某個男人是中國哪個省份的人。該男人的特徵參數爲:{1.72, 0.75, 黃}


特徵屬性X={身高,口音,膚色}

分類: {DB,HN,GD}

特徵屬性劃分:我們把身高分爲三個段(1.65>z, 1.65<=z<1.75,  z>=1.75),把口音按普通話接近程度分爲{0.3>h, 0.3<=h<0.7, h>=0.7},把膚色分爲{白,黃,黑}


統計樣本:

1)計算訓練樣本中每個類別的頻率

P(DB)=0.32  P(HN)=0.43  P(GD)=0.25

2)計算每個類別條件下各特徵屬性劃分的頻率

屬於分類DB的三個屬性:

P(1.65>z |DB)=0.1

P(1.65<=z<1.75 | DB) = 0.3

P( z>=1.75 |DB) = 0.6


P(0.3>h |DB)=0.15

P(0.3<=h<0.7 | DB) = 0.1

P( h>=0.7|DB) = 0.75


P(白色 |DB)=0.35

P(黃色 | DB) = 0.55

P(黑色 |DB) = 0.1


屬於分類HN的三個屬性劃分的頻率:

P(1.65>z |HN)=0.1

P(1.65<=z<1.75 |HN) = 0.65

P( z>=1.75 |HN) = 0.25


P(0.3>h |HN)=0.1

P(0.3<=h<0.7 |HN) = 0.15

P( h>=0.7|HN) = 0.75


P(白色 |HN)=0.2

P(黃色 | HN) = 0.6

P(黑色 |HN) = 0.2

屬於分類GD的三個屬性劃分的頻率:

P(1.65>z |GD)=0.35

P(1.65<=z<1.75 | GD) = 0.45

P( z>=1.75 |GD) = 0.2


P(0.3>h |GD)=0.8

P(0.3<=h<0.7 | GD) = 0.1

P( h>=0.7|GD) = 0.1


P(白色 |GD)=0.1

P(黃色 | GD) = 0.55

P(黑色 |GD) = 0.35

3)確定樣本分類

樣本x={1.72, 0.75, 黃}屬於DB的概率:

P(DB)P(x|DB)=P(DB) * P(1.65<=z<1.75|DB) * P( h>=0.7|DB) * P(黃色 | DB) =0.32*0.3*0.75* 0.55=0.0396


屬於HN的概率:

P(HN)P(x|HN)=P(HN) * P(1.65<=z<1.75|DB) * P( h>=0.7|HN) * P(黃色 | HN) = 0.43*0.65*0.75*0.6 =0.125775


屬於GD的概率:

P(GD)P(x|GD)=P(GD) * P(1.65<=z<1.75|GD) * P( h>=0.7|GD) * P(黃色 | GD) = 0.25*0.45*0.1*0.55 = 0.006875


可見x屬於HN















發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章