Lucene中自動補全Suggest模塊的索引追加和更新的解決方案

我用的版本是Lucene-Suggest-4.7.jar
在做類似百度搜索中自動補全模塊的時候遇到的問題——索引追加建立,索引更新權重。本問主要解決這兩個問題。大家可能在網絡上已經搜索到了Lucene的Suggest包的基本用法,這裏再簡單的說一下:
使用suggest包建立索引時和用lucene的IndexWriter建立索引有很大的不同,這裏建立索引時,大概需要三個類:實體類,實體類的迭代器類,具體操作的類。實體類不在多說,代碼如下:

public class Suggester implements Serializable {
    private static final long serialVersionUID = 1L;
    String term;
    int times;
    /**
     * @param term  詞條
     * @param times  詞頻
     */
    public Suggester(String term, int times) {
        this.term = term;
        this.times = times;
    }
    public Suggester() {
        super();
    }
    /**
     * @return the term
     */
    public String getTerm() {
        return term;
    }
    /**
     * @param term the term to set
     */
    public void setTerm(String term) {
        this.term = term;
    }
    /**
     * @return the times
     */
    public int getTimes() {
        return times;
    }
    /**
     * @param times the times to set
     */
    public void setTimes(int times) {
        this.times = times;
    }
    /* (non-Javadoc)
     * @see java.lang.Object#toString()
     */
    @Override
    public String toString() {
        return term + " " + times;
    }
    /* (non-Javadoc)
     * @see java.lang.Object#hashCode()
     */
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + ((term == null) ? 0 : term.hashCode());
        return result;
    }
    /*
     * 只對比term
     * @see java.lang.Object#equals(java.lang.Object)
     */
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Suggester other = (Suggester) obj;
        if (term == null) {
            if (other.term != null)
                return false;
        } else if (!term.equals(other.term))
            return false;
        return true;
    }
}
具體操作的類也是調方法就OK,實體類的迭代器類我們看下源代碼就明白爲什麼需要這個了



這個是AnalyzingInfixSuggester類中建立索引的方法,其參數要求是傳入一個InputIterator 對象,即實體類的迭代器類,下面看下實體類的迭代器類:


public class SuggesterIterator implements InputIterator {
    /**集合的迭代器 */
    private final Iterator<Suggester> suggesterIterator;
    /**遍歷的當前的Suggerter  */
    private Suggester currentSuggester;
    /**
     * 構造方法
     * @param suggesterIterator
     */
    public SuggesterIterator(Iterator<Suggester> suggesterIterator) {
        this.suggesterIterator = suggesterIterator;
    }
    /*  
     * 迭代下一個
     * @see org.apache.lucene.util.BytesRefIterator#next()
     */
    @Override
    public BytesRef next() throws IOException {
        if (suggesterIterator.hasNext()) {
            currentSuggester = suggesterIterator.next();
            String term = currentSuggester.getTerm();
            try {
                return new BytesRef(term.getBytes("UTF8"));
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
            }
        }
        //如果出錯或者遍歷完返回空
        return null;
    }
    /*
     * 是否有payload數據信息
     * @see org.apache.lucene.search.suggest.InputIterator#hasPayloads()
     */
    @Override
    public boolean hasPayloads() {
        return true;
    }
    /*
     *  payload數據,存其他後期需要取出的各種數據,這裏存詞頻
     * @see org.apache.lucene.search.suggest.InputIterator#payload()
     */
    @Override
    public BytesRef payload() {
        /**如hasPayloads retrun false 以下代碼無用    */
        try {
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            DataOutputStream dos = new DataOutputStream(bos);
            dos.writeInt(currentSuggester.getTimes());
            dos.close();
            return new BytesRef(bos.toByteArray());
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }
    /*
     * 自定義的排序規則
     * @see org.apache.lucene.search.suggest.InputIterator#weight()
     */
    @Override
    public long weight() {
        //當前權重爲詞頻
        return currentSuggester.getTimes();
    }
    /*  
     * @see org.apache.lucene.util.BytesRefIterator#getComparator()
     */
    @Override
    public Comparator<BytesRef> getComparator() {
        return null;
    }
}



 在準備好之後便可以調用suggest包中的build方法建立索引了:
   /**
     * 創建索引
     * @param list 待建立索引的數據集
     * @return 創建時間
     */
    public double create(List<Suggester> list, String indexPath) {
        //耗時
        long time = 0l;
        //索引創建管理工具
        AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
        try {
            AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, new File(indexPath), analyzer);
            loger.debug("開始創建自動補全索引");
            Long begin = System.currentTimeMillis();
            //build索引
            AnalyzingInfixSuggester.build(new SuggesterIterator(list.iterator()));
            Long end = System.currentTimeMillis(); 
            time = end - begin;
            loger.debug("創建自動補全索引完成!耗時: " + time + "ms");
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //關閉
            AnalyzingInfixSuggester.close();
        }
        return time / 1000.0;
    }


 測試的主要代碼:
        List<Suggester> list = new ArrayList<Suggester>();
        list.add(new Suggester("張三", 1)); 
        list.add(new Suggester("李四", 2));
        double time = suggestService.create(list, "file/autoComplete/project/template/index");
        System.out.println(time + " ms");

當執行完上面的代碼 在索引中便建立好了 張三和李四,兩條Document的索引,我們可以看到建立好的索引結構如下:



看到和IndexWriter的區別了吧,注意,我們上面建立索引是使用的空格分詞器。具體索引文件的結構有興趣就自己再研究吧。


查詢部分 不多說,直接看代碼自己研究去吧
/**
     * 自動補全查詢索引
     * @param region  查詢條件
     * @param indexPath 索引位置
     * @return  查詢結果集
     */
    public List<Suggester> lookup(String region, String indexPath) {
        //返回的結果集 
        List<Suggester> reList = new ArrayList<Suggester>();
        //索引文件
        File indexFile = new File(indexPath);
        //索引創建管理工具
        AnalyzingInfixSuggester AnalyzingInfixSuggester = null;
        // 查詢結果集
        List<LookupResult> results = null;
        try {
            AnalyzingInfixSuggester = new AnalyzingInfixSuggester(Version.LUCENE_47, indexFile, analyzer);
            /*
             *   查詢結果    
             *     region- 查詢的關鍵詞
             *     TOPS- 返回的最多數量
             *     allTermsRequired - should或者must關係
             *     doHighlight - 高亮
             */
            results = AnalyzingInfixSuggester.lookup(region, TOPS, true, true);
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            AnalyzingInfixSuggester.close();
        }
        /*
         * 遍歷結果
         */
        System.out.println("輸入詞:" + region);
        for (LookupResult result : results) {
            String str = (String) result.highlightKey;
            Integer time = null;
            try {
                //獲取payload部分詞頻信息 —— 詞頻
                BytesRef bytesRef = result.payload;
                DataInputStream dis = new DataInputStream(new ByteArrayInputStream(bytesRef.bytes));
                time = dis.readInt();
                dis.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
            reList.add(new Suggester(str, time));
        }
        /*
         * 剔除搜索關鍵詞自身
         */
        for (int i = 0; i < reList.size(); i++) {
            Suggester sug = reList.get(i);
            //剔除高亮標籤後進行比較
            if (sug.getTerm().replaceAll("<[^>]*>", "").equals(region)) {
                reList.remove(sug);
                break;
            }
        }
        return reList;
    }

在你都建立好了索引,查詢也成功之後,那麼問題來了:如果我想在索引中追加新的索引怎麼辦?如果我想修改(update)索引怎麼辦?然而在你不斷的查找過後發現,Suggest並沒有提供相關方法……那麼接下來着重介紹這兩個問題的解決方法。

翻看源碼,可以發現在build索引時也是使用IndexWriter,並且有個getIndexWriterConfig方法


在getIndexWriterConfig方法中可以看到,索引文件的打開模式是OpenMode.CREATE固定的,所以索引的建立方法只能是新建,不能是追加。



解決方法就是繼承AnalyzingInfixSuggester,重寫getIndexWriterConfig方法,製作我們自己的AnalyzingInfixSuggester。
代碼如下:

public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {

    /**索引創建方式(新建或追加)*/
    private final OpenMode mode;

......

    /*
     * 重載 構造方法 初始化相關變量
     * @param matchVersion  Lucene版本
     * @param indexPath 索引文件目錄
     * @param analyzer 分詞器
     * @param mode 索引創建方式(新建或追加)
     * @throws IOException 
     */
    public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
        //調用父類構造方法
        super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
        this.mode = mode;
.....
    }


 /*
     * 重寫獲得IndexWriterConfig的方法
     * 增加索引創建方式可變(新建或追加)
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
     */
    @Override
    protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
        IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
        iwc.setCodec(new Lucene46Codec());
        if (indexAnalyzer instanceof AnalyzerWrapper) {
            //如果是tmp目錄,採用新建方式打開索引文件
            iwc.setOpenMode(OpenMode.CREATE);
        } else {
            iwc.setOpenMode(mode);
        }
        return iwc;
    }
......
}

這樣,就可以在new MyAnalyzingInfixSuggester 的時候傳入我們指定的索引打開模式,便可實現追加建立索引。但是,如果你只這樣寫就想追加索引是不可以的,因爲在Suggest內部有他自己的排序算法,就是在建立索引時候便根據權重weight進行排序,在查詢時候只返回一個文檔號,比如在索引中已經有了張三、李四,你再APPEND一個王五進去,在搜索“王”的時候結果會給你顯示李四。是不是很鬱悶?解決辦法就是取消Suggest的建立時就排序的步驟,增加在搜索時排序:
一下是源碼中在建立索引時的排序方法,再MyAnalyzingInfixSuggester中重寫build方法 刪除掉一下代碼即可。


重寫lookup方法 刪除掉下面代碼並增加排序方法:(源碼中的註釋也有解釋)



經過上面的處理便萬事大吉。可以完美解決APPEND索引的問題。
第二個問題,更新索引就簡單了,只需要調用IndexWriter的delete方法刪除對應的Document之後再把需要更新的對象包裝成list傳入create進行build即可!
Directory fsDir = FSDirectory.open(new File(indexPath));
            IndexWriter indexWriter = new IndexWriter(fsDir, new IndexWriterConfig(ManageIndexService.LUCENE_VERSION, analyzer));
            //刪除對應的詞條
            indexWriter.deleteDocuments(new Term(MyAnalyzingInfixSuggester.TEXT_FIELD_NAME, sug.getTerm()));
            //徹底刪除
            indexWriter.forceMergeDeletes();
            //關閉IndexWriter
            indexWriter.commit();
            indexWriter.close();
            loger.debug("刪除舊索引成功:" + sug.getTerm());

            List<Suggester> list = new ArrayList<Suggester>();
            list.add(sug);
            //添加建立新的詞條索引
            this.create(list, indexPath, OpenMode.APPEND);

下面附上完整的MyAnalyzingInfixSuggester代碼。


import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.codecs.lucene46.Lucene46Codec;
import org.apache.lucene.document.BinaryDocValuesField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.index.AtomicReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.MultiDocValues;
import org.apache.lucene.index.SlowCompositeReaderWrapper;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.suggest.InputIterator;
import org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IOUtils;
import org.apache.lucene.util.Version;


public class MyAnalyzingInfixSuggester extends AnalyzingInfixSuggester {
    /** 日誌 **/
    private final Logger logger = Logger.getLogger(MyAnalyzingInfixSuggester.class);

    /** Field name used for the indexed text. */
    public static final String TEXT_FIELD_NAME = "text";

    /** Default minimum number of leading characters before
     *  PrefixQuery is used (4). */
    public static final int DEFAULT_MIN_PREFIX_CHARS = 4;
    private final File indexPath;
    final int minPrefixChars;
    final Version matchVersion;
    private final Directory dir;
    /**索引創建方式(新建或追加)*/
    private final OpenMode mode;

    /*
     * 重載 構造方法 初始化相關變量
     * @param matchVersion  Lucene版本
     * @param indexPath 索引文件目錄
     * @param analyzer 分詞器
     * @param mode 索引創建方式(新建或追加)
     * @throws IOException 
     */
    public MyAnalyzingInfixSuggester(Version matchVersion, File indexPath, Analyzer analyzer, OpenMode mode) throws IOException {
        //調用父類構造方法
        super(matchVersion, indexPath, analyzer, analyzer, DEFAULT_MIN_PREFIX_CHARS);
        this.mode = mode;
        this.indexPath = indexPath;
        this.minPrefixChars = DEFAULT_MIN_PREFIX_CHARS;
        this.matchVersion = matchVersion;
        dir = getDirectory(indexPath);
    }

    /*
     * 重寫獲得IndexWriterConfig的方法
     * 增加索引創建方式可變(新建或追加)
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#getIndexWriterConfig(org.apache.lucene.util.Version, org.apache.lucene.analysis.Analyzer)
     */
    @Override
    protected IndexWriterConfig getIndexWriterConfig(Version matchVersion, Analyzer indexAnalyzer) {
        IndexWriterConfig iwc = new IndexWriterConfig(matchVersion, indexAnalyzer);
        iwc.setCodec(new Lucene46Codec());
        if (indexAnalyzer instanceof AnalyzerWrapper) {
            //如果是tmp目錄,採用新建方式打開索引文件
            iwc.setOpenMode(OpenMode.CREATE);
        } else {
            iwc.setOpenMode(mode);
        }
        return iwc;
    }

    /*
     * 重寫查詢方法,取消在建立索引時候進行排序
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#build(org.apache.lucene.search.suggest.InputIterator)
     */
    @Override
    public void build(InputIterator iter) throws IOException {
        if (searcher != null) {
            searcher.getIndexReader().close();
            searcher = null;
        }
        Directory dirTmp = getDirectory(new File(indexPath.toString() + ".tmp"));
        IndexWriter w = null;
        IndexWriter w2 = null;
        AtomicReader r = null;
        boolean success = false;
        try {
            Analyzer gramAnalyzer = new AnalyzerWrapper(Analyzer.PER_FIELD_REUSE_STRATEGY) {
                @Override
                protected Analyzer getWrappedAnalyzer(String fieldName) {
                    return indexAnalyzer;
                }

                @Override
                protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
                    if (fieldName.equals("textgrams") && minPrefixChars > 0) {
                        return new TokenStreamComponents(components.getTokenizer(), new EdgeNGramTokenFilter(matchVersion, components.getTokenStream(), 1, minPrefixChars));
                    } else {
                        return components;
                    }
                }
            };
            w = new IndexWriter(dirTmp, getIndexWriterConfig(matchVersion, gramAnalyzer));
            BytesRef text;
            Document doc = new Document();
            FieldType ft = getTextFieldType();
            Field textField = new Field(TEXT_FIELD_NAME, "", ft);
            doc.add(textField);

            Field textGramField = new Field("textgrams", "", ft);
            doc.add(textGramField);

            Field textDVField = new BinaryDocValuesField(TEXT_FIELD_NAME, new BytesRef());
            doc.add(textDVField);

            Field weightField = new NumericDocValuesField("weight", 0);
            doc.add(weightField);

            Field payloadField;
            if (iter.hasPayloads()) {
                payloadField = new BinaryDocValuesField("payloads", new BytesRef());
                doc.add(payloadField);
            } else {
                payloadField = null;
            }
            long t0 = System.nanoTime();
            while ((text = iter.next()) != null) {
                String textString = text.utf8ToString();
                textField.setStringValue(textString);
                textGramField.setStringValue(textString);
                textDVField.setBytesValue(text);
                weightField.setLongValue(iter.weight());
                if (iter.hasPayloads()) {
                    payloadField.setBytesValue(iter.payload());
                }
                w.addDocument(doc);
            }
            logger.debug("initial indexing time: " + ((System.nanoTime() - t0) / 1000000) + " msec");

            r = SlowCompositeReaderWrapper.wrap(DirectoryReader.open(w, false));
            w.rollback();

            w2 = new IndexWriter(dir, getIndexWriterConfig(matchVersion, indexAnalyzer));
            w2.addIndexes(new IndexReader[] { r });
            r.close();

            searcher = new IndexSearcher(DirectoryReader.open(w2, false));
            w2.close();

            payloadsDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), "payloads");
            weightsDV = MultiDocValues.getNumericValues(searcher.getIndexReader(), "weight");
            textDV = MultiDocValues.getBinaryValues(searcher.getIndexReader(), TEXT_FIELD_NAME);
            assert textDV != null;
            success = true;
        } finally {
            if (success) {
                IOUtils.close(w, w2, r, dirTmp);
            } else {
                IOUtils.closeWhileHandlingException(w, w2, r, dirTmp);
            }
        }
    }

    /*
     * 重寫查詢方法,改變結果排序的方法
     * @see org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester#lookup(java.lang.CharSequence, int, boolean, boolean)
     */
    @Override
    public List<LookupResult> lookup(CharSequence key, int num, boolean allTermsRequired, boolean doHighlight) {

        if (searcher == null) {
            throw new IllegalStateException("suggester was not built");
        }

        final BooleanClause.Occur occur;
        if (allTermsRequired) {
            occur = BooleanClause.Occur.MUST;
        } else {
            occur = BooleanClause.Occur.SHOULD;
        }

        TokenStream ts = null;
        try {
            ts = queryAnalyzer.tokenStream("", new StringReader(key.toString()));
            ts.reset();
            final CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
            final OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
            String lastToken = null;
            BooleanQuery query = new BooleanQuery();
            int maxEndOffset = -1;
            final Set<String> matchedTokens = new HashSet<String>();
            while (ts.incrementToken()) {
                if (lastToken != null) {
                    matchedTokens.add(lastToken);
                    query.add(new TermQuery(new Term(TEXT_FIELD_NAME, lastToken)), occur);
                }
                lastToken = termAtt.toString();
                if (lastToken != null) {
                    maxEndOffset = Math.max(maxEndOffset, offsetAtt.endOffset());
                }
            }
            ts.end();

            String prefixToken = null;
            if (lastToken != null) {
                Query lastQuery;
                if (maxEndOffset == offsetAtt.endOffset()) {
                    // Use PrefixQuery (or the ngram equivalent) when
                    // there was no trailing discarded chars in the
                    // string (e.g. whitespace), so that if query does
                    // not end with a space we show prefix matches for
                    // that token:
                    lastQuery = getLastTokenQuery(lastToken);
                    prefixToken = lastToken;
                } else {
                    // Use TermQuery for an exact match if there were
                    // trailing discarded chars (e.g. whitespace), so
                    // that if query ends with a space we only show
                    // exact matches for that term:
                    matchedTokens.add(lastToken);
                    lastQuery = new TermQuery(new Term(TEXT_FIELD_NAME, lastToken));
                }
                if (lastQuery != null) {
                    query.add(lastQuery, occur);
                }
            }
            ts.close();

            Query finalQuery = finishQuery(query, allTermsRequired);

            //新建排序方法
            Sort sort = new Sort(new SortField("weight", SortField.Type.LONG, true));
            TopDocs hits = searcher.search(finalQuery, num, sort);

            List<LookupResult> results = createResults(hits, num, key, doHighlight, matchedTokens, prefixToken);
            return results;
        } catch (IOException ioe) {
            throw new RuntimeException(ioe);
        } finally {
            IOUtils.closeWhileHandlingException(ts);
        }
    }

}

下一篇 有時間可能會寫寫跨度查詢、近同義詞什麼的,我也寫了個完成的Demo,畢竟這個網上一搜一大把 就不着急了。針對以上文章和lucene相關的有什麼問題可以+692790242 來一起討論討論。歡迎


版權所有, 轉載請註明出處!By MRC

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章