《Lucene In Action》讀書筆記

P27:需要理解索引的核心類:IndexWriter,Directory,Analyzer,Document,Field


P29:理解搜索的核心類:IndexSearcher,Term,Query,TermQuery,TopDocs


p34:但獲得搜索結果時,只能得到已經存儲的字段(stored fields)。例如,只索引但是沒有存儲的字段不會出現。這經常會帶來困惑。

p42:不幸的是,儘管更新索引這樣的功能是經常被提出,Lucene依然不能實現:相反,它刪除之前的整個文檔,然後加入一個新文檔。這就要求新加入的文檔包含所有的字段內容,包括原來文檔沒有改變的內容。

P45:Store.No--不存儲相應的值。這個選項經常和Index.ANALYZED一起使用,來處理不需要取回內容的大段文本字段,例如網頁內容,或者其它文本內容。

P47:字段總結表,保護字段生成方式和實例
Index                                                 Store       TermVector                                                 Example usage
NOT_ANALYZED_NO_NORMS   YES         NO                                                    Identifiers (filenames, primary keys),telephone and Social Security
                                                                                                                                      numbers,URLs,personal names, dates,and textual fields for sorting
ANALYZED                                       YES         WITH_POSITIONS_OFFSETS    Document title, document abstract
ANALYZED                                       NO           WITH_POSITIONS_OFFSETS    Document body
NO                                                     YES         NO                                                     Document type, database primary key
                                                                                                                                      (if not used for searching)
NOT_ANALYZED                            NO           NO                                                     Hidden keywords


P51:注意:如果在建索引過程中關掉norms,必須重建整個索引,因爲即使只有一個文檔的字段有norms,但段進行合併的時候會給每個文檔增加一個字節,甚至norms是關掉的。這之所以發生是因爲Lucene並不分別存儲norms。

P56:在索引文件優化過程中,Lucene需要更多的磁盤空間,最大是原來的三倍。優化完成後,索引文件一般會比優化前小。

P58:MMapDirectory:利用內存映射的方式訪問文件的目錄結構。對於64位JRE是個很好的選擇,或者對於32位JRE索引文件比較小。

但對於具有足夠內存的計算機,大部分操作系統會使用可用的內存作爲I/O緩存。這就意味着,在warming up後,FSDirectory將和RAMDirectory幾乎一樣快。


P67
Lucene uses a simple approach to record deleted documents in the index: the document is marked as deleted in a bit array, which is a quick operation, but the data corresponding to that document still consumes disk space in the index. This technique is necessary because in an inverted index, a given document’s terms are scattered all over the place, and it’d be impractical to try to reclaim that space when the document is deleted. It’s not until segments are merged, either by normal merging over time or by an explicit call to optimize, that these bytes are reclaimed.


P76
Lucene’s primary searching API
IndexSearcher           Gateway to searching an index. All searches come through an IndexSearcher instance using any of the several overloaded search methods.
Query (andsubclasses)   Concrete subclasses encapsulate logic for a particular query type. Instances of Query are passed to an IndexSearcher’s search method.
QueryParser             Processes a human-entered (and readable) expression into a concrete Query object.
TopDocs                 Holds the top scoring documents, returned by IndexSearcher.search.
ScoreDoc                Provides access to each search result in TopDocs.


P147
The only built-in analyzer capable of doing anything useful with Asian text is the StandardAnalyzer, which recognizes some ranges of the Unicode space as CJK characters and tokenizes them individually.


p346
A well-tuned Lucene application is like a well-maintained car: it will operate for years without problems, requiring only a small, informed investment on your part. You can take pride in that!


P347
Ask yourself, honestly (use a mirror, if necessary): would your time be better spent improving the user interface or tuning relevance? You can always improve performance by simply rolling out more or faster hardware, so always consider that option first.

P348
Simple performance-tuning steps:
Upgrade to the latest release of Lucene.
Upgrade to the latest release of Java; then try tuning the JVM’s performance settings.
Run your JVM with the -server switch;
Use a local file system for your index.
Run a Java profiler, or collect your own rough timing using System.nanoTime,to verify your performance problem is in fact Lucene and not elsewhere in your application stack.
Do not reopen IndexWriter or IndexReader/IndexSearcher any more frequently than required.
Use multiple threads.
Use faster hardware.
Put as much physical memory as you can in your computers, and configure Lucene to take advantage of all of it.
Budget enough memory, CPU, and file descriptors for your peak usage.
Turn off any fields or features that your application isn’t using. Be ruthless!
Group multiple text fields into a single text field.?

P356
If you’re not on Windows, use NIOFSDirectory, which has better concurrency, instead of FSDirectory. If you’re running with a 64-bit JVM, try MMapDirectory
as well.

P356
Be sure you’re using enough threads to fully utilize the computer’s hardware. Increase the thread count until throughput no longer improves, but don’t add so many threads that latency gets worse. There’s a sweet spot—find it!

P357
Therefore, it’s critical to use threads for indexing and searching. Otherwise, you’re simply not fully utilizing the computer. It’s like buying a Corvette and driving it no
faster than 20 mph!

P358
Lucene is thread-safe: sharing IndexSearcher, IndexReader, IndexWriter, and so forth across multiple threads is perfectly fine.

P365
Figure 11.3 shows the disk usage over time while indexing all documents from Wikipedia, finishing with an optimize call. The final disk usage was 14.2 GB, but the
peak disk usage was 32.4 GB, which was reached while several large concurrent merges were running.


P367
A good rule of thumb is to measure the total size of your index. Let’s call that X. Then, make sure at all times you have two times free disk space on the file system where the index is stored at all times.

P373
In a production server environment, you should set both of these sizes to the same value, so the JVM doesn’t spend time growing and shrinking the heap.

P373
On Unix, run vmstat 1 to print virtual memory statistics, once per second.
Using top on Unix, check the Mem: line.
For example, on Linux there is a kernel parameter called swappiness; setting it to 0 forces the OS to never swap out RAM for I/O cache.


P374
During indexing, one big usage of RAM is the buffer used by IndexWriter, which you can control with setRAMBufferSizeMB.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章