我用的版本是Lucene-Suggest-4.7.jar
在做類似百度搜索中自動補全模塊的時候遇到的問題——索引追加建立,索引更新權重。本問主要解決這兩個問題。大家可能在網絡上已經搜索到了Lucene的Suggest包的基本用法,這裏再簡單的說一下:
在做類似百度搜索中自動補全模塊的時候遇到的問題——索引追加建立,索引更新權重。本問主要解決這兩個問題。大家可能在網絡上已經搜索到了Lucene的Suggest包的基本用法,這裏再簡單的說一下:
使用suggest包建立索引時和用lucene的IndexWriter建立索引有很大的不同,這裏建立索引時,大概需要三個類:實體類,實體類的迭代器類,具體操作的類。實體類不在多說,代碼如下:
public class Suggester implements Serializable {
private static final long serialVersionUID = 1L;
String term;
int times;
/**
* @param term 詞條
* @param times 詞頻
*/
public Suggester(String term, int times) {
this.term = term;
this.times = times;
}
public Suggester() {
super();
}
/**
* @return the term
*/
public String getTerm() {
return term;
}
/**
* @param term the term to set
*/
public void setTerm(String term) {
this.term = term;
}
/**
* @return the times
*/
public int getTimes() {
return times;
}
/**
* @param times the times to set
*/
public void setTimes(int times) {
this.times = times;
}
/* (non-Javadoc)
* @see java.lang.Object#toString()
*/
@Override
public String toString() {
return term + " " + times;
}
/* (non-Javadoc)
* @see java.lang.Object#hashCode()
*/
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((term == null) ? 0 : term.hashCode());
return result;
}
/*
* 只對比term
* @see java.lang.Object#equals(java.lang.Object)
*/
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
Suggester other = (Suggester) obj;
if (term == null) {
if (other.term != null)
return false;
} else if (!term.equals(other.term))
return false;
return true;
}
}
具體操作的類也是調方法就OK,實體類的迭代器類我們看下源代碼就明白爲什麼需要這個了:
這個是AnalyzingInfixSuggester類中建立索引的方法,其參數要求是傳入一個InputIterator 對象,即實體類的迭代器類,下面看下實體類的迭代器類:
public class SuggesterIterator implements InputIterator {
/**集合的迭代器 */
private final Iterator<Suggester> suggesterIterator;
/**遍歷的當前的Suggerter */
private Suggester currentSuggester;
/**
* 構造方法
* @param suggesterIterator
*/
public SuggesterIterator(Iterator<Suggester> suggesterIterator) {
this.suggesterIterator = suggesterIterator;
}
/*
* 迭代下一個
* @see org.apache.lucene.util.BytesRefIterator#next()
*/
@Override
public BytesRef next() throws IOException {
if (suggesterIterator.hasNext()) {
currentSuggester = suggesterIterator.next();
String term = currentSuggester.getTerm();
try {
return new BytesRef(term.getBytes("UTF8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
//如果出錯或者遍歷完返回空
return null;
}
/*
* 是否有payload數據信息
* @see org.apache.lucene.search.suggest.InputIterator#hasPayloads()
*/
@Override
public boolean hasPayloads() {
return true;
}
/*
* payload數據,存其他後期需要取出的各種數據,這裏存詞頻
* @see org.apache.lucene.search.suggest.InputIterator#payload()
*/
@Override
public BytesRef payload() {
/**如hasPayloads retrun false 以下代碼無用 */
try {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
dos.writeInt(currentSuggester.getTimes());
dos.close();
return new BytesRef(bos.toByteArray());
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
/*
* 自定義的排序規則
* @see org.apache.lucene.search.suggest.InputIterator#weight()
*/
@Override
public long weight() {
//當前權重爲詞頻
return currentSuggester.getTimes();
}
/*
* @see org.apache.lucene.util.BytesRefIterator#getComparator()
*/
@Override
public Comparator<BytesRef> getComparator() {
return null;
}
}
在準備好之後便可以調用suggest包中的build方法建立索引了:
/**
* 創建索引
* @param list 待建立索引的數據集
* @return 創建時間
*/
public double create(List<Suggester> list, String indexPath) {
//耗時
long time = 0l;
//索引創建管理工具
AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
try {
AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, new File(indexPath), analyzer);
loger.debug("開始創建自動補全索引");
Long begin = System.currentTimeMillis();
//build索引
AnalyzingInfixSuggester.build(new SuggesterIterator(list.iterator()));
Long end = System.currentTimeMillis();
time = end - begin;
loger.debug("創建自動補全索引完成!耗時: " + time + "ms");
} catch (IOException e) {
e.printStackTrace();
} finally {
//關閉
AnalyzingInfixSuggester.close();
}
return time / 1000.0;
}
測試的主要代碼:
List<Suggester> list = new ArrayList<Suggester>();
list.add(new Suggester("張三", 1));
list.add(new Suggester("李四", 2));
double time = suggestService.create(list, "file/autoComplete/project/template/index");
System.out.println(time + " ms");
當執行完上面的代碼 在索引中便建立好了 張三和李四,兩條Document的索引,我們可以看到建立好的索引結構如下:
看到和IndexWriter的區別了吧,注意,我們上面建立索引是使用的空格分詞器。具體索引文件的結構有興趣就自己再研究吧。
查詢部分
不多說,直接看代碼自己研究去吧
/**
* 自動補全查詢索引
* @param region 查詢條件
* @param indexPath 索引位置
* @return 查詢結果集
*/
public List<Suggester> lookup(String region, String indexPath) {
//返回的結果集
List<Suggester> reList = new ArrayList<Suggester>();
//索引文件
File indexFile = new File(indexPath);
//索引創建管理工具
AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
// 查詢結果集
List<LookupResult> results = null;
try {
AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, indexFile, analyzer);
/*
* 查詢結果
* region- 查詢的關鍵詞
* TOPS- 返回的最多數量
* allTermsRequired - should或者must關係
* doHighlight - 高亮
*/
results = AnalyzingInfixSuggester.lookup(region, TOPS, true, true);
} catch (IOException e) {
e.printStackTrace();
} finally {
AnalyzingInfixSuggester.close();
}
/*
* 遍歷結果
*/
System.out.println("輸入詞:" + region);
for (LookupResult result : results) {
String str = (String) result.highlightKey;
Integer time = null;
try {
//獲取payload部分詞頻信息 —— 詞頻
BytesRef bytesRef = result.payload;
DataInputStream dis = new DataInputStream(new ByteArrayInputStream(bytesRef.bytes));
time = dis.readInt();
dis.close();
} catch (Exception e) {
e.printStackTrace();
}
reList.add(new Suggester(str, time));
}
/*
* 剔除搜索關鍵詞自身
*/
for (int i = 0; i < reList.size(); i++) {
Suggester sug = reList.get(i);
//剔除高亮標籤後進行比較
if (sug.getTerm().replaceAll("<[^>]*>", "").equals(region)) {
reList.remove(sug);
break;
}
}
return reList;
}
在你都建立好了索引,查詢也成功之後,那麼問題來了:如果我想在索引中追加新的索引怎麼辦?如果我想修改(update)索引怎麼辦?然而在你不斷的查找過後發現,Suggest並沒有提供相關方法……那麼接下來着重介紹這兩個問題的解決方法。
翻看源碼,可以發現在build索引時也是使用IndexWriter,並且有個getIndexWriterConfig方法
在getIndexWriterConfig方法中可以看到,索引文件的打開模式是OpenMode.CREATE固定的,所以索引的建立方法只能是新建,不能是追加。
解決方法就是繼承AnalyzingInfixSuggester,重寫getIndexWriterConfig方法,製作我們自己的AnalyzingInfixSuggester。
代碼如下:
public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {
/**索引創建方式(新建或追加)*/
private final OpenMode mode;
......
/*
* 重載 構造方法 初始化相關變量
* @param matchVersion Lucene版本
* @param indexPath 索引文件目錄
* @param analyzer 分詞器
* @param mode 索引創建方式(新建或追加)
* @throws IOException
*/
public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
//調用父類構造方法
super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
this.mode = mode;
.....
}
/*
* 重寫獲得IndexWriterConfig的方法
* 增加索引創建方式可變(新建或追加)
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
*/
@Override
protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
iwc.setCodec(new Lucene46Codec());
if (indexAnalyzer instanceof AnalyzerWrapper) {
//如果是tmp目錄,採用新建方式打開索引文件
iwc.setOpenMode(OpenMode.CREATE);
} else {
iwc.setOpenMode(mode);
}
return iwc;
}
......
}
這樣,就可以在new
MyAnalyzingInfixSuggester 的時候傳入我們指定的索引打開模式,便可實現追加建立索引。但是,如果你只這樣寫就想追加索引是不可以的,因爲在Suggest內部有他自己的排序算法,就是在建立索引時候便根據權重weight進行排序,在查詢時候只返回一個文檔號,比如在索引中已經有了張三、李四,你再APPEND一個王五進去,在搜索“王”的時候結果會給你顯示李四。是不是很鬱悶?解決辦法就是取消Suggest的建立時就排序的步驟,增加在搜索時排序:
一下是源碼中在建立索引時的排序方法,再MyAnalyzingInfixSuggester中重寫build方法
刪除掉一下代碼即可。
重寫lookup方法 刪除掉下面代碼並增加排序方法:(源碼中的註釋也有解釋)
經過上面的處理便萬事大吉。可以完美解決APPEND索引的問題。
第二個問題,更新索引就簡單了,只需要調用IndexWriter的delete方法刪除對應的Document之後再把需要更新的對象包裝成list傳入create進行build即可!
Directory fsDir = FSDirectory.open(new File(indexPath));
IndexWriter indexWriter = new IndexWriter(fsDir, new IndexWriterConfig(ManageIndexService.LUCENE_VERSION, analyzer));
//刪除對應的詞條
indexWriter.deleteDocuments(new Term(MyAnalyzingInfixSuggester.TEXT_FIELD_NAME, sug.getTerm()));
//徹底刪除
indexWriter.forceMergeDeletes();
//關閉IndexWriter
indexWriter.commit();
indexWriter.close();
loger.debug("刪除舊索引成功:" + sug.getTerm());
List<Suggester> list = new ArrayList<Suggester>();
list.add(sug);
//添加建立新的詞條索引
this.create(list, indexPath, OpenMode.APPEND);
下面附上完整的MyAnalyzingInfixSuggester代碼。
import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.codecs.lucene46.Lucene46Codec;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.MultiDocValues;
import org.apache.lucene.index.SlowCompositeReaderWrapper;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.Version;
public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {
/** 日誌 **/
private final Logger logger = Logger.getLogger(MyAnalyzingInfixSuggester.class);
/** Field name used for the indexed text. */
public static final String TEXT_FIELD_NAME = "text";
/** Default minimum number of leading characters before
* PrefixQuery is used (4). */
public static final int DEFAULT_MIN_PREFIX_CHARS = 4;
private final File indexPath;
final int minPrefixChars;
final Version matchVersion;
private final Directory dir;
/**索引創建方式(新建或追加)*/
private final OpenMode mode;
/*
* 重載 構造方法 初始化相關變量
* @param matchVersion Lucene版本
* @param indexPath 索引文件目錄
* @param analyzer 分詞器
* @param mode 索引創建方式(新建或追加)
* @throws IOException
*/
public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
//調用父類構造方法
super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
this.mode = mode;
this.indexPath = indexPath;
this.minPrefixChars = DEFAULT_MIN_PREFIX_CHARS;
this.matchVersion = matchVersion;
dir = getDirectory(indexPath);
}
/*
* 重寫獲得IndexWriterConfig的方法
* 增加索引創建方式可變(新建或追加)
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
*/
@Override
protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
iwc.setCodec(new Lucene46Codec());
if (indexAnalyzer instanceof AnalyzerWrapper) {
//如果是tmp目錄,採用新建方式打開索引文件
iwc.setOpenMode(OpenMode.CREATE);
} else {
iwc.setOpenMode(mode);
}
return iwc;
}
/*
* 重寫查詢方法,取消在建立索引時候進行排序
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#build(org.apache.lucene.search.suggest.InputIterator)
*/
@Override
public void build(InputIterator iter) throws IOException {
if (searcher != null) {
searcher.getIndexReader().close();
searcher = null;
}
Directory dirTmp = getDirectory(new File(indexPath.toString() + ".tmp"));
IndexWriter w = null;
IndexWriter w2 = null;
AtomicReader r = null;
boolean success = false;
try {
Analyzer gramAnalyzer = new AnalyzerWrapper(Analyzer.PER_FIELD_REUSE_STRATEGY) {
@Override
protected Analyzer getWrappedAnalyzer(String fieldName) {
return indexAnalyzer;
}
@Override
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
if (fieldName.equals("textgrams") && minPrefixChars > 0) {
return new TokenStreamComponents(components.getTokenizer(), new EdgeNGramTokenFilter(matchVersion, components.getTokenStream(), 1, minPrefixChars));
} else {
return components;
}
}
};
w = new IndexWriter(dirTmp, getIndexWriterConfig(matchVersion, gramAnalyzer));
BytesRef text;
Document doc = new Document();
FieldType ft = getTextFieldType();
Field textField = new Field(TEXT_FIELD_NAME, "", ft);
doc.add(textField);
Field textGramField = new Field("textgrams", "", ft);
doc.add(textGramField);
Field textDVField = new BinaryDocValuesField(TEXT_FIELD_NAME, new BytesRef());
doc.add(textDVField);
Field weightField = new NumericDocValuesField("weight", 0);
doc.add(weightField);
Field payloadField;
if (iter.hasPayloads()) {
payloadField = new BinaryDocValuesField("payloads", new BytesRef());
doc.add(payloadField);
} else {
payloadField = null;
}
long t0 = System.nanoTime();
while ((text = iter.next()) != null) {
String textString = text.utf8ToString();
textField.setStringValue(textString);
textGramField.setStringValue(textString);
textDVField.setBytesValue(text);
weightField.setLongValue(iter.weight());
if (iter.hasPayloads()) {
payloadField.setBytesValue(iter.payload());
}
w.addDocument(doc);
}
logger.debug("initial indexing time: " + ((System.nanoTime() - t0) / 1000000) + " msec");
r = SlowCompositeReaderWrapper.wrap(DirectoryReader.open(w, false));
w.rollback();
w2 = new IndexWriter(dir, getIndexWriterConfig(matchVersion, indexAnalyzer));
w2.addIndexes(new IndexReader[] { r });
r.close();
searcher = new IndexSearcher(DirectoryReader.open(w2, false));
w2.close();
payloadsDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), "payloads");
weightsDV = MultiDocValues.getNumericValues(searcher.getIndexReader(), "weight");
textDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), TEXT_FIELD_NAME);
assert textDV != null;
success = true;
} finally {
if (success) {
IOUtils.close(w, w2, r, dirTmp);
} else {
IOUtils.closeWhileHandlingException(w, w2, r, dirTmp);
}
}
}
/*
* 重寫查詢方法,改變結果排序的方法
* @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#lookup(java.lang.CharSequence, int, boolean, boolean)
*/
@Override
public List<LookupResult> lookup(CharSequence key, int num, boolean allTermsRequired, boolean doHighlight) {
if (searcher == null) {
throw new IllegalStateException("suggester was not built");
}
final BooleanClause.Occur occur;
if (allTermsRequired) {
occur = BooleanClause.Occur.MUST;
} else {
occur = BooleanClause.Occur.SHOULD;
}
TokenStream ts = null;
try {
ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()));
ts.reset();
final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
String lastToken = null;
BooleanQuery query = new BooleanQuery();
int maxEndOffset = -1;
final Set<String> matchedTokens = new HashSet<String>();
while (ts.incrementToken()) {
if (lastToken != null) {
matchedTokens.add(lastToken);
query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
}
lastToken = termAtt.toString();
if (lastToken != null) {
maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
}
}
ts.end();
String prefixToken = null;
if (lastToken != null) {
Query lastQuery;
if (maxEndOffset == offsetAtt.endOffset()) {
// Use PrefixQuery (or the ngram equivalent) when
// there was no trailing discarded chars in the
// string (e.g. whitespace), so that if query does
// not end with a space we show prefix matches for
// that token:
lastQuery = getLastTokenQuery(lastToken);
prefixToken = lastToken;
} else {
// Use TermQuery for an exact match if there were
// trailing discarded chars (e.g. whitespace), so
// that if query ends with a space we only show
// exact matches for that term:
matchedTokens.add(lastToken);
lastQuery = new TermQuery(new Term(TEXT_FIELD_NAME, lastToken));
}
if (lastQuery != null) {
query.add(lastQuery, occur);
}
}
ts.close();
Query finalQuery = finishQuery(query, allTermsRequired);
//新建排序方法
Sort sort = new Sort(new SortField("weight", SortField.Type.LONG, true));
TopDocs hits = searcher.search(finalQuery, num, sort);
List<LookupResult> results = createResults(hits, num, key, doHighlight, matchedTokens, prefixToken);
return results;
} catch (IOException ioe) {
throw new RuntimeException(ioe);
} finally {
IOUtils.closeWhileHandlingException(ts);
}
}
}
下一篇 有時間可能會寫寫跨度查詢、近同義詞什麼的,我也寫了個完成的Demo,畢竟這個網上一搜一大把 就不着急了。針對以上文章和lucene相關的有什麼問題可以+692790242 來一起討論討論。歡迎
版權所有, 轉載請註明出處!By MRC