lucene學習筆記之構建索引

構建索引

2.2理解索引過程

文本首先從原始數據中提取出來用於創建對應的Document實例，該實例包括多個Field實例，他們都用來保存原始數據信息，隨後的分析過程將域文本處理成大量的語彙單元，最後將語彙單元加入到段結構中。

2.2.1提取文本和創建文檔

有關提取文本信息的細節將在第七章結合Tika框架詳談。

2.2.2 分析文檔

在索引操作時，Lucene首先分析文本，將文本數據分割成語彙單元串，對於中文主要是分詞和去停用詞，這樣就產生了大批的語彙單元，隨後這些語彙單元將被寫入索引文件中。

2.2.3 向索引添加文檔

Lucene的索引文件目錄有唯一一個段結構：索引段

索引段：Lucene索引都包含一個或多個段，每個段都是一個獨立的索引，它包含整個文檔索引的一個子集。每當writer刷新緩衝區增加的文檔，以及掛起目錄刪除操作時，索引文件都會建立一個新段。在搜索索引時，每個段都是單獨訪問的，但搜索結果是合併返回的。

每個段都包含多個文件，文件格式_X.<ext>，這裏X代表段名稱，<ext>爲擴展名，用來標識該文件對應索引的某個部分，各個獨立的文件共同組成了索引的不同部分（項向量，存儲的域，倒排索引....）。如果使用混合文件格式（這是Lucene默認的處理方式，但可以通過IndexWriter.setUseCompoundFile方法進行修改），那麼上述索引文件都會被壓縮成一個單一的文件：_X.cfs。這種方式能在搜索期間減少打開的文件數量。

還有一個特殊文件，段文件，用段_<N>標識，該文件指向所有激活的段。Lucene會首先打開該文件，然後打開它所指向的其他文件，Lucene每次向索引提交更改都會將這個數加1。

久而久之，索引會聚集很多段，特別是當程序打開和關閉writer較爲頻繁時，IndexWriter類會週期性的選擇一些段，然後將它們合併到一個新段。

2.3 基本索引操作

2.3.1 想索引添加文檔

添加文檔的方法有兩個：

addDocument(Document)-----使用默認分析器添加文檔，該分析器在創建IndexWriter對象時指定，用於語彙單元化操作。

addDocument(Document , Analyzer)-----使用指定的分析器添加文檔和語彙單元操作。

整個建立索引的代碼如下：

public class LuceneIndex {
	public static void main(String[] args) throws Exception {
		//A path to a directory where we store the Lucene index
		File indexDir = new File("F:\\ntcr_index");
		//A path to a directory that contains the files we want to index
		File dataDir = new File("F:\\NTCR_ChangeCodeToUTF");
		long start = new Date().getTime();
		int numIndexed = index(indexDir, dataDir);//get the number of Indexed
		long end = new Date().getTime();
		System.out.println("一共索引了 " + numIndexed + " 個文件，共消耗時間 " + (end - start) + " 毫秒。");
	}
	//open an index and start file directory traversal0
	public static int index(File indexDir, File dataDir) throws IOException {
		//	Indexer: traverses a file system and indexes .txt files
		//	Create Lucene index	in this directory Index files in this directory
		if (!dataDir.exists() || !dataDir.isDirectory()) {
			throw new IOException(dataDir + " 不存在或不是目錄。");
		}
		/*
		 * >=3.2.0版本的IndexWriter的使用
		 */
		WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_CURRENT);
		
		Directory directory = FSDirectory.open(indexDir);
		IndexWriterConfig indexConfig = new IndexWriterConfig(
				Version.LUCENE_34, analyzer);
		indexConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
		IndexWriter writer = new IndexWriter(directory, indexConfig);
		indexDirectory(writer, dataDir);
		int numIndexed = writer.numDocs();
		System.out.println("優化中......................");
		System.out.println("請耐心等待...................");
		writer.optimize();
		writer.close();
		return numIndexed;
	}
//recursive method that calls itself when it finds a directory.遞歸調用
	private static void indexDirectory(IndexWriter writer, File dir)
			throws IOException {
		File[] files = dir.listFiles();//files number
		for (int i = 0; i < files.length; i++) {
			File f = files[i];
			if (f.isDirectory()) {
				indexDirectory(writer, f);
			} else if (f.getName().endsWith(".txt")) {
				indexFile(writer,f);
			}
		}
	}
	//		 method to actually index a file using Lucene
	private static void indexFile(IndexWriter writer, File f)throws IOException {
		if (f.isHidden() || !f.exists() || !f.canRead()) {
			return;
		}
		System.out.println("索引... " + f.getCanonicalPath());
				
		BufferedReader reader = new BufferedReader(new FileReader(f));
		Document doc = new Document();
	 	  doc.add(new Field("FilePath", f.getCanonicalPath(), Field.Store.YES,
	 			 Field.Index.ANALYZED,TermVector.YES));
		  doc.add(new Field("FileName", f.getName(), Field.Store.YES,
				Field.Index.ANALYZED,TermVector.YES));
		//默認爲索引，不儲存，分詞
		doc.add(new Field("textField",reader.readLine(),Field.Store.YES,
				Field.Index.ANALYZED,TermVector.YES));
		//Add document to Lucene index
		writer.addDocument(doc);
	}
}

2.13.1 用IndexReader刪除文檔

1）IndexReader能夠根據文檔號刪除文檔

2）IndexReader可以通過Term對象刪除文檔，這與IndexWriter類似，但前者會返回被刪除的文檔號。

3）如果程序使用相同的reader進行搜索的話，IndexReader的刪除操作會即時生效，而用IndexWriter刪除必須等到程序打開一個新的Reader才能感知。

4）IndexWriter可以通過Query對象執行刪除操作，但IndexWriter不行。

5）IndexReader提供了一個有時非常有用的方法undeleteAll，該方法能反向操作索引中所有掛起的刪除。該方法只能對還未進行段合併的文檔進行反刪除操作，因爲IndexWriter只是將被刪除文檔標記爲刪除狀態，最終刪除是在該文檔所對應的段合併時進行的。

lucene學習筆記之構建索引

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

go語言 defer延遲機制

《集體智慧編程》之決策樹（學習筆記）

NMF算法簡介及python實現(gradient descent)

lucene學習筆記之構建索引

支持向量機筆記（四） Kernel

python庫學習之re

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結