【Neo4j】踩坑大會-Neo4J用中文索引

正在用的Neo4j是當前最新版:3.1.0,各種踩坑。說一下如何在Neo4j 3.1.0中使用中文索引。選用了IKAnalyzer做分詞器。


1. 首先參考文章:

https://segmentfault.com/a/1190000005665612

裏面大致講了用IKAnalyzer做索引的方式。但並不清晰,實際上,這篇文章的背景是用嵌入式Neo4j,即Neo4j一定要嵌入在你的Java應用中(https://neo4j.com/docs/java-reference/current/#tutorials-java-embedded),切記。否則無法使用自定義的Analyzer。其次,文中的方法現在用起來已經有問題了,因爲Neo4j 3.1.0用了lucene5.5,故官方的IKAnalyzer已經不適用了。


2. 修正

 轉用 IKAnalyzer2012FF_u1.jar,在Google可以下載到(https://code.google.com/archive/p/ik-analyzer/downloads)。這個版本的IKAnalyzer是有小夥伴修復了IKAnalyzer不適配lucene3.5以上而修改的一個版本。但是用了這個包仍有問題,報錯提示:

Caused by: java.lang.AbstractMethodError: org.apache.lucene.analysis.Analyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;

即IKAnalyzer的Analyzer類和當前版本的lucene仍有不適配的地方。

解決方案:再增加兩個類

package com.uc.wa.function;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;

public class IKAnalyzer5x extends Analyzer{

	private boolean useSmart;
	
	public boolean useSmart() {
		return useSmart;
	}

	public void setUseSmart(boolean useSmart) {
		this.useSmart = useSmart;
	}

	public IKAnalyzer5x(){
		this(false);
	}
	
	public IKAnalyzer5x(boolean useSmart){
		super();
		this.useSmart = useSmart;
	}

	
	/**
	protected TokenStreamComponents createComponents(String fieldName, final Reader in) {
		Tokenizer _IKTokenizer = new IKTokenizer(in , this.useSmart());
		return new TokenStreamComponents(_IKTokenizer);
	}
	**/
	
	
    /**
     * 重寫最新版本的createComponents
     * 重載Analyzer接口,構造分詞組件
     */
	@Override
	protected TokenStreamComponents createComponents(String fieldName) {
		Tokenizer _IKTokenizer = new IKTokenizer5x(this.useSmart());
		return new TokenStreamComponents(_IKTokenizer);
	}
}

package com.uc.wa.function;

import java.io.IOException;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

public class IKTokenizer5x extends Tokenizer{
    
    //IK�ִ���ʵ��
    private IKSegmenter _IKImplement;
     
    //��Ԫ�ı�����
    private final CharTermAttribute termAtt;
    //��Ԫλ������
    private final OffsetAttribute offsetAtt;
    //��Ԫ�������ԣ������Է���ο�org.wltea.analyzer.core.Lexeme�еķ��ೣ����
    private final TypeAttribute typeAtt;
    //��¼���һ����Ԫ�Ľ���λ��
    private int endPosition;
     
     
    /**
    public IKTokenizer(Reader in , boolean useSmart){
        super(in);
        offsetAtt = addAttribute(OffsetAttribute.class);
        termAtt = addAttribute(CharTermAttribute.class);
        typeAtt = addAttribute(TypeAttribute.class);
        _IKImplement = new IKSegmenter(input , useSmart);
    }**/
     
    /**
     * Lucene 5.x Tokenizer�������๹�캯��
     * ʵ�����µ�Tokenizer�ӿ�
     * @param useSmart
     */
    public IKTokenizer5x(boolean useSmart){
        super();
        offsetAtt = addAttribute(OffsetAttribute.class);
        termAtt = addAttribute(CharTermAttribute.class);
        typeAtt = addAttribute(TypeAttribute.class);
        _IKImplement = new IKSegmenter(input , useSmart);
    }
 
    /* (non-Javadoc)
     * @see org.apache.lucene.analysis.TokenStream#incrementToken()
     */
    @Override
    public boolean incrementToken() throws IOException {
        //������еĴ�Ԫ����
        clearAttributes();
        Lexeme nextLexeme = _IKImplement.next();
        if(nextLexeme != null){
            //��Lexemeת��Attributes
            //���ô�Ԫ�ı�
            termAtt.append(nextLexeme.getLexemeText());
            //���ô�Ԫ����
            termAtt.setLength(nextLexeme.getLength());
            //���ô�Ԫλ��
            offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
            //��¼�ִʵ����λ��
            endPosition = nextLexeme.getEndPosition();
            //��¼��Ԫ����
            typeAtt.setType(nextLexeme.getLexemeTypeString());          
            //����true��֪�����¸���Ԫ
            return true;
        }
        //����false��֪��Ԫ������
        return false;
    }
     
    /*
     * (non-Javadoc)
     * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
     */
    @Override
    public void reset() throws IOException {
        super.reset();
        _IKImplement.reset(input);
    }   
     
    @Override
    public final void end() {
        // set final offset
        int finalOffset = correctOffset(this.endPosition);
        offsetAtt.setOffset(finalOffset, finalOffset);
    }
}

解決 IKAnalyzer2012FF_u1.jar和lucene5不適配的問題。使用時用IKAnalyzer5x替換IKAnalyzer即可。


3. 最後

Neo4j中文索引建立和搜索示例:

	/**
	 * 爲單個結點創建索引
	 * 
	 * @param propKeys
	 */
	public static void createFullTextIndex(long id, List<String> propKeys) {
		log.info("method[createFullTextIndex] begin.propKeys<"+propKeys+">");
		Index<Node> entityIndex = null;
		
		try (Transaction tx = Neo4j.graphDb.beginTx()) {
			entityIndex = Neo4j.graphDb.index().forNodes("NodeFullTextIndex",
					MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer5x.class.getName()));
			
			Node node = Neo4j.graphDb.getNodeById(id);
			log.info("method[createFullTextIndex] get node id<"+node.getId()+"> name<"
					+node.getProperty("knowledge_name")+">");
			/**獲取node詳細信息*/
			Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0]))
					.entrySet();
			for (Map.Entry<String, Object> property : properties) {
				log.info("method[createFullTextIndex] index prop<"+property.getKey()+":"+property.getValue()+">");
				entityIndex.add(node, property.getKey(), property.getValue());
			}
			tx.success();
		}
	}

	/**
	 * 使用索引查詢
	 * 
	 * @param query
	 * @return
	 * @throws IOException 
	 */
	public static List<Map<String, Object>> selectByFullTextIndex(String[] fields, String query) throws IOException {
        List<Map<String, Object>> ret = Lists.newArrayList();
		try (Transaction tx = Neo4j.graphDb.beginTx()) {
			IndexManager index = Neo4j.graphDb.index();
			/**查詢*/
			Index<Node> addressNodeFullTextIndex = index.forNodes("NodeFullTextIndex",
					MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer5x.class.getName()));
			Query q = IKQueryParser.parseMultiField(fields, query);
			
			IndexHits<Node> foundNodes = addressNodeFullTextIndex.query(q);

	        for(Node n : foundNodes){
	        	Map<String, Object> m = n.getAllProperties();
	        	if(!Float.isNaN(foundNodes.currentScore())){
		        	m.put("score", foundNodes.currentScore());
	        	}
				log.info("method[selectByIndex] score<"+foundNodes.currentScore()+">");
	        	ret.add(m);
	        }
			tx.success();
		} catch (IOException e) {
			log.error("method[selectByIndex] fields<"+Joiner.on(",").join(fields)+"> query<"+query+">", e);
			throw e;
		}
		return ret;
	}

注意到,在這裏我用了IKQueryParser,即根據我們的查詢詞和要查詢的字段,自動構造Query。這裏是繞過了一個坑:用lucene查詢語句直接查的話,是有問題的。比如:“address:南昌市” 查詢語句,會搜到所有帶市字的地址,這是非常不合理的。改用IKQueryParser即修正這個問題。IKQueryParser是IKAnalyzer自帶的一個工具,但在 IKAnalyzer2012FF_u1.jar卻被刪減掉了。因此我這裏重新引入了原版IKAnalyzer的jar包,項目最終是兩個jar包共存的。


到這裏坑就踩得差不多了。



    


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章