Lucene in action 筆記 term vector

Leveraging term vectors
所謂term vector, 就是對於documents的某一field,如title,body這種文本類型的, 建立詞頻的多維向量空間.每一個詞就是一維, 這維的值就是這個詞在這個field中的頻率.

如果你要使用term vectors, 就要在indexing的時候對該field打開term vectors的選項:

Field options for term vectors
TermVector.YES – record the unique terms that occurred, and their counts, in each document, but do not store any positions or offsets information.
TermVector.WITH_POSITIONS – record the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets.
TermVector.WITH_OFFSETS – record the unique terms and their counts, with the offsets (start & end character position) of each occurrence of every term, but no positions.
TermVector.WITH_POSITIONS_OFFSETS – store unique terms and their counts, along with positions and offsets.
TermVector.NO – do not store any term vector information.
If Index.NO is specified for a field, then you must also specify TermVector.NO.

這樣在index完後, 給定這個document id和field名稱, 我們就可以從IndexReader讀出這個term vector(前提是你在indexing時創建了terms vector):
TermFreqVector termFreqVector = reader.getTermFreqVector(id, "subject");
你可以遍歷這個TermFreqVector去取出每個詞和詞頻, 如果你在index時選擇存下offsets和positions信息的話, 你在這邊也可以取到.

有了這個term vector我們可以做一些有趣的應用:
1) Books like this
比較兩本書是否相似,把書抽象成一個document文件, 具有author, subject fields. 那麼現在就通過這兩個field來比較兩本書的相似度.
author這個field是multiple fields, 就是說可以有多個author, 那麼第一步就是比author是否相同,
String[] authors = doc.getValues("author");
BooleanQuery authorQuery = new BooleanQuery(); // #3
for (int i = 0; i < authors.length; i++) { // #3
    String author = authors[i]; // #3
    authorQuery.add(new TermQuery(new Term("author", author)), BooleanClause.Occur.SHOULD); // #3
}
authorQuery.setBoost(2.0f);
最後還可以把這個查詢的boost值設高, 表示這個條件很重要, 權重較高, 如果作者相同, 那麼就很相似了.
第二步就用到term vector了, 這裏用的很簡單, 單純的看subject field的term vector中的term是否相同,
TermFreqVector vector = // #4
reader.getTermFreqVector(id, "subject"); // #4
BooleanQuery subjectQuery = new BooleanQuery(); // #4
for (int j = 0; j < vector.size(); j++) { // #4
    TermQuery tq = new TermQuery(new Term("subject", vector.getTerms()[j]));
    subjectQuery.add(tq, BooleanClause.Occur.SHOULD); // #4
}

2) What category?
這個比上個例子高級一點, 怎麼分類了,還是對於document的subject, 我們有了term vector.
所以對於兩個document, 我們可以比較這兩個文章的term vector在向量空間中的夾角, 夾角越小說明這個兩個document越相似.
那麼既然是分類就有個訓練的過程, 我們必須建立每個類的term vector作爲個標準, 來給其它document比較.
這裏用map來實現這個term vector, (term, frequency), 用n個這樣的map來表示n維. 我們就要爲每個category來生成一個term vector, category和term vector也可以用一個map來連接.創建這個category的term vector, 這樣做:
遍歷這個類中的每個document, 取document的term vector, 把它加到category的term vector上.
private void addTermFreqToMap(Map vectorMap, TermFreqVector termFreqVector) {
    String[] terms = termFreqVector.getTerms();
    int[] freqs = termFreqVector.getTermFrequencies();
    for (int i = 0; i < terms.length; i++) {
        String term = terms[i];
        if (vectorMap.containsKey(term)) {
            Integer value = (Integer) vectorMap.get(term);
            vectorMap.put(term, new Integer(value.intValue() + freqs[i]));
        } else {
            vectorMap.put(term, new Integer(freqs[i]));
        }
   }
}
首先從document的term vector中取出term和frequency的list, 然後從category的term vector中取每一個term, 把document的term frequency加上去.OK了

有了這個每個類的category, 我們就要開始計算document和這個類的向量夾角了
cos = A*B/|A||B|
A*B就是點積, 就是兩個向量每一維相乘, 然後全加起來.
這裏爲了簡便計算, 假設document中term frequency只有兩種情況, 0或1.就表示出現或不出現
private double computeAngle(String[] words, String category) {
    // assume words are unique and only occur once
    Map vectorMap = (Map) categoryMap.get(category);
    int dotProduct = 0;
    int sumOfSquares = 0;
    for (int i = 0; i < words.length; i++) {
        String word = words[i];
        int categoryWordFreq = 0;
        if (vectorMap.containsKey(word)) {
            categoryWordFreq = ((Integer) vectorMap.get(word)).intValue();
        }
        dotProduct += categoryWordFreq; // optimized because we assume frequency in words is 1
        sumOfSquares += categoryWordFreq * categoryWordFreq;
    }
    double denominator;
    if (sumOfSquares == words.length) {
        // avoid precision issues for special case
        denominator = sumOfSquares; // sqrt x * sqrt x = x
    } else {
        denominator = Math.sqrt(sumOfSquares) *
        Math.sqrt(words.length);
    }
    double ratio = dotProduct / denominator;
    return Math.acos(ratio);
}
這個函數就是實現了上面那個公式還是比較簡單的.

3) MoreLikeThis

對於找到比較相似的文檔，lucene還提供了個比較高效的接口，MoreLikeThis接口

http://lucene.apache.org/java/1_9_1/api/org/apache/lucene/search/similar/MoreLikeThis.html

對於上面的方法我們可以比較每兩篇文檔的餘弦值，然後對餘弦值進行排序，找出最相似的文檔，但這個方法的最大問題在於計算量太大，當文檔數目很大時，幾乎是無法接受的，當然有專門的方法去優化餘弦法，可以使計算量大大減少，但這個方法精確，但門檻較高。

這個接口的原理很簡單，對於一篇文檔中，我們只需要提取出interestingTerm（即tf×idf高的詞），然後用lucene去搜索包含相同詞的文檔，作爲相似文檔，這個方法的優點就是高效，但缺點就是不準確，這個接口提供很多參數，你可以配置來選擇interestingTerm。

MoreLikeThis mlt = new MoreLikeThis(ir);

Reader target = ...

// orig source of doc you want to find similarities to

Query query = mlt.like( target);

Hits hits = is.search(query);

用法很簡單，這樣就可以得到，相似的文檔

這個接口比較靈活，你可以不直接用like接口，而是用
retrieveInterestingTerms(Reader r)

這樣你可以獲得interestingTerm，然後怎麼處理就根據你自己的需要了。

Lucene in action 筆記 term vector

Lucene in action 筆記 term vector

數論(算法概述)

Classify Text With NLTK

Extracting Information from Text With NLTK

Hadoop- The Definitive Guide 筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結