什麼是TF-IDF
TF-IDF(Term Frequency-Inverse Document Frequency),漢譯爲詞頻-逆文本頻率指數。
TF指一個詞出現的頻率,假設在一篇文章中某個詞出現的次數是n,文章的總詞數是N,那麼TF=n/N
逆文本頻率指數IDF一般用於表示一個詞的權重,其求解辦法爲IDFi=log(D/Dw),這裏D指的是文本總量,Dw指的是詞i在Dw篇文本中出現過。
這篇文章講解的很詳細《TF-IDF原理及使用》
什麼是餘弦相似
餘弦相似度用向量空間中兩個向量夾角的餘弦值作爲衡量兩個個體間差異的大小。餘弦值越接近1,就表明夾角越接近0度,也就是兩個向量越相似,這就叫"餘弦相似性"。
假設向量a、b的座標分別爲(x1,y1)、(x2,y2) 。則:
TF-IDF和餘弦相似應用
這裏有兩篇文章講解的非常清楚,我就不再多說了,直接上文章鏈接。
下面就具體講解下代碼的實現。
添加Gradle依賴
用到了WebMagic爬蟲框架、Jieba分詞java版,Lucene、Apache等一些庫
compile group: 'us.codecraft', name: 'webmagic-core', version: '0.7.3'
// https://mvnrepository.com/artifact/us.codecraft/webmagic-extension
compile group: 'us.codecraft', name: 'webmagic-extension', version: '0.7.3'
// https://mvnrepository.com/artifact/com.huaban/jieba-analysis
compile group: 'com.huaban', name: 'jieba-analysis', version: '1.0.2'
compile group: 'commons-io', name: 'commons-io', version: '2.6'
compile group: 'org.apache.lucene', name: 'lucene-core', version: '3.6.0'
compile group: 'org.apache.lucene', name: 'lucene-queryparser', version: '3.6.0'
爬取樣本庫並進行分詞
因爲測試算法的有效性需要大量的文本,我採用WebMagic爬蟲框架,爬取華爲應用市場的應用描述信息來當做樣本庫。
WebMaigc的使用請看《WebMagic爬取應用市場應用信息》。
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;
import us.codecraft.webmagic.selector.Selectable;
/**
* @author wzj
* @create 2018-07-17 22:06
**/
public class AppStoreProcessor implements PageProcessor
{
// 部分一:抓取網站的相關配置,包括編碼、抓取間隔、重試次數等
private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);
public void process(Page page)
{
//獲取名稱
String name = page.getHtml().xpath("//p/span[@class='title']/text()").toString();
page.putField("appName",name );
String desc = page.getHtml().xpath("//div[@id='app_strdesc']/text()").toString();
page.putField("desc",desc );
if (page.getResultItems().get("appName") == null)
{
//skip this page
page.setSkip(true);
}
//獲取頁面其他鏈接
Selectable links = page.getHtml().links();
page.addTargetRequests(links.regex("(http://app.hicloud.com/app/C\\d+)").all());
}
public Site getSite()
{
return site;
}
public static void main(String[] args)
{
Spider.create(new AppStoreProcessor())
.addUrl("http://app.hicloud.com")
.addPipeline(new MyPipeline())
.thread(20)
.run();
}
}
自定義Piple來保存爬取的應用數據,因爲要對描述信息進行分詞,需要對數據進行預處理,主要包含
- 通過正則去除中文特殊字符和標點符號 desc.replaceAll("[\\p{P}+~$`^=|<>~`$^+=|<>¥×]", "")
- 通過正則去除回車符、製表符等特殊符號 desc.replaceAll("\\t|\\r|\\n","");
- 通過正則去除空格 desc.replaceAll(" ","");
接着對數據進行分詞,採用jieba分析java版進行分詞處理
import com.huaban.analysis.jieba.JiebaSegmenter;
import org.apache.commons.io.IOUtils;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.List;
/**
* @author wzj
* @create 2018-07-17 22:16
**/
public class MyPipeline implements Pipeline
{
/**
* 保存文件的路徑
*/
private static final String saveDir = "D:\\cache\\";
/**
* jieba分詞java版
*/
private JiebaSegmenter segmenter = new JiebaSegmenter();
/*
* 統計數目
*/
private int count = 1;
/**
* Process extracted results.
*
* @param resultItems resultItems
* @param task task
*/
public void process(ResultItems resultItems, Task task)
{
String appName = resultItems.get("appName");
String desc = resultItems.get("desc");
//去除標點符號
desc = desc.replaceAll("[\\p{P}+~$`^=|<>~`$^+=|<>¥×]", "");
desc = desc.replaceAll("\\t|\\r|\\n","");
//去除空格
desc = desc.replaceAll(" ","");
List<String> vecList = segmenter.sentenceProcess(desc);
StringBuilder stringBuilder = new StringBuilder();
for (String s : vecList)
{
stringBuilder.append(s + " ");
}
//去除最後一個空格
String writeContent = stringBuilder.toString();
if (writeContent.length() > 0)
{
writeContent = writeContent.substring(0,writeContent.length() - 1);
}
String appSavePath = Paths.get(saveDir, appName + ".txt").toString();
FileWriter fileWriter = null;
try
{
fileWriter = new FileWriter(appSavePath);
fileWriter.write(writeContent);
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
IOUtils.closeQuietly(fileWriter);
}
System.out.println(String.valueOf(count++) + " " + appName);
}
}
將爬取文本建立Lucene索引
需要指定文本文件路徑和索引保存路徑
/**
* 將所有的文檔加入lucene中
* @throws IOException
*/
public void indexDocs() throws IOException
{
System.out.println("Number of files : " + docNumbers);
File[] listOfFiles = Paths.get(docPath).toFile().listFiles();
NIOFSDirectory dir = new NIOFSDirectory(new File(saveIndexPath));
IndexWriter indexWriter = new IndexWriter(dir,
new IndexWriterConfig(Version.LUCENE_36, new WhitespaceAnalyzer(Version.LUCENE_36)));
for (File file : listOfFiles)
{
//讀取文件內容,並去除數字標點符號
String fileContent = fileReader(file);
fileContent = fileContent.replaceAll("\\d+(?:[.,]\\d+)*\\s*", "");
String docName = file.getName();
Document doc = new Document();
doc.add(new Field("docContent", new StringReader(fileContent), Field.TermVector.YES));
doc.add(new Field("docName", new StringReader(docName), Field.TermVector.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
System.out.println("Add document successful.");
}
TF-IDF算法實現
首先計算已有文檔的TF-IDF
/**
* 獲取所有文檔的tf-idf值
* @return 結果
* @throws IOException IOException
* @throws ParseException ParseException
*/
public HashMap<String, Map<String, Float>> getAllTFIDF() throws IOException, ParseException
{
HashMap<String, Map<String, Float>> scoreMap = new HashMap<String, Map<String, Float>>();
IndexReader re = IndexReader.open(NIOFSDirectory.open(new File(saveIndexPath)), true);
for (int k = 0; k < docNumbers; k++)
{
//每一個文檔的tf-idf
Map<String, Float> wordMap = new HashMap<String, Float>();
//獲取當前文檔的內容
TermFreqVector termsFreq = re.getTermFreqVector(k, "docContent");
TermFreqVector termsFreqDocId = re.getTermFreqVector(k, "docName");
String docName = termsFreqDocId.getTerms()[0];
int[] freq = termsFreq.getTermFrequencies();
String[] terms = termsFreq.getTerms();
int noOfTerms = terms.length;
DefaultSimilarity simi = new DefaultSimilarity();
for (int i = 0; i < noOfTerms; i++)
{
int noOfDocsContainTerm = re.docFreq(new Term("docContent", terms[i]));
float tf = simi.tf(freq[i]);
float idf = simi.idf(noOfDocsContainTerm, docNumbers);
wordMap.put(terms[i], (tf * idf));
}
scoreMap.put(docName, wordMap);
}
return scoreMap;
}
接着輸入一段測試文本,在已有的文本庫中進行查找,使用上面同樣的方法計算出待查找文本的TF-IDF,具體的代碼就不在貼出來。
最後餘弦相似度來找出最相似的文本。
/**
* 計算餘弦相似度
* @param searchTextTfIdfMap 查找文本的向量
* @param allTfIdfMap 所有文本向量
* @return 計算出當前查詢文本與所有文本的相似度
*/
private static Map<String,Double> cosineSimilarity(Map<String, Float> searchTextTfIdfMap,HashMap<String, Map<String, Float>> allTfIdfMap)
{
//key是相似的文檔名稱,value是與當前文檔的相似度
Map<String,Double> similarityMap = new HashMap<String,Double>();
//計算查找文本向量絕對值
double searchValue = 0;
for (Map.Entry<String, Float> entry : searchTextTfIdfMap.entrySet())
{
searchValue += entry.getValue() * entry.getValue();
}
for (Map.Entry<String, Map<String, Float>> docEntry : allTfIdfMap.entrySet())
{
String docName = docEntry.getKey();
Map<String, Float> docScoreMap = docEntry.getValue();
double termValue = 0;
double acrossValue = 0;
for (Map.Entry<String, Float> termEntry : docScoreMap.entrySet())
{
if (searchTextTfIdfMap.get(termEntry.getKey()) != null)
{
acrossValue += termEntry.getValue() * searchTextTfIdfMap.get(termEntry.getKey());
}
termValue += termEntry.getValue() * termEntry.getValue();
}
similarityMap.put(docName,acrossValue/(termValue * searchValue));
}
return similarityMap;
}
最後測試效果還不錯,可以找出最相近的文本。
源碼下載
Github地址:https://github.com/HelloKittyNII/DocSimilarityAlgorithm