Solr實現Low Level查詢解析(QParser)

Solr基於Lucene提供了方便的查詢解析和搜索服務器的功能,可以以插件的方式集成,非常容易的擴展我們自己需要的查詢解析方式。其中,Solr內置了一些QParser,對一些沒有特殊要求的應用來說,可以直接使用這些查詢解析組件,而無需做任何修改,只需要瞭解這些查詢解析組件提供的基本參數(Local Params),就可以實現強大的搜索功能。

對於Solr來說,它的設計目標就是儘可能屏蔽底層Lucene的複雜度和難點,而是通過提供可配置的方式來實現全文檢索。我們標題所說的Low Level是指,在Solr裏面直接使用Lucene的查詢語法,來構造滿足需要的查詢,例如:+(title:solr) +(+(title:lucene content:hadoop) (title:search)),這樣的話,你應該瞭解Lucene的查詢語法。因爲在實際應用中,完全使用Solr自帶一些QParser可能不能夠達到我們的目的,比如,你在對數據進行索引,索引時使用了詞典的方式進行分詞,詞典中出現的一些關鍵詞很可能是與用戶交互設計中內容相關的(如搜索某個關鍵詞,向用戶推薦一些向關鍵詞),那麼,在前端需要將某些關鍵詞進行某種組合,提交到後端進行解析搜索。在後端,就會存在一個專門的查詢解析組件(在Solr中成爲QParser,可以擴展),最終將解析成Lucene識別的“語言”,從而進行索引搜索,返回搜索結果。

下面是一個簡單的例子:

用戶搜索“北京”,我需要提供相關的一組同義關鍵詞:“北平”、”首都“、”京城“、”京都“;而此時,與”北京“相關的一組關鍵詞:”首都博物館“、”故宮“、”天壇“、”八達嶺長城“,其中”首博“是”首都博物館“的同義詞;我們需要實現的是,當用戶搜索”北京“時,對其進行同義詞擴展搜索(這個在Solr裏面可以直接使用同義詞Analyzer),但是當用戶點擊這組相關關鍵詞時,需要進行擴展,比如點擊”首都博物館“進行搜索,這時擴展搜索Lucene能夠解析的形式爲:

+((title:北京 content:北京) (title:北平 content:北平) (title:首都 content:首都) (title:京城 content:京城) (title:京都 content:京都)) +((title:首都博物館 content:首都博物館) (title:首博 content:首博))
實際上,如果直接使用Lucene,可能會比較容易的多,只需要根據分詞詞典中具有的Term(存在於索引中),構造滿足實際需要的Query即可實現搜索。但是,在Solr裏面,將構造查詢解析的邏輯移到了QParser中,基於QParserPlugin可以很好地使用Solr提供的一些基礎組件和附加組件,並且,這些自定義組件都是基於solrconfig.xml來進行配置的,比較靈活。

當然,Solr提供了一個QParserPlugin插件,核心查詢解析在LuceneQParser中實現,是一個相對Low Level的組件,只需要在solrconfig.xml中配置好相應的requestHandler即可,實例如下:

  <queryParser name="lucene" class="org.apache.solr.search.LuceneQParserPlugin"/>
  <requestHandler name="/lucene" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="defType">lucene</str>
      <str name="bf">recip(ms(NOW,publishDate),3.16e-13,1,1)</str>
      <str name="qf">title^1.50 content</str>
      
      <bool name="hl">true</bool>
      <str name="hl.fl">title content</str>
      <int name="hl.fragsize">100</int>
      <int name="hl.snippets">3</int>
      
      <str name="fl">*,score</str>
      <str name="qt">standard</str>
      <str name="wt">standard</str>
      <str name="version">2.2</str>
      <str name="echoParams">explicit</str>
      <str name="indent">true</str>
      <str name="debugQuery">on</str>
      <str name="explainOther">on</str>
    </lst>
  </requestHandler>
啓動Solr搜索服務器(如,部署在tomcat容器中),如果你直接輸入上述Lucene能夠識別的Query字符串:
http://192.168.0.181:8080/solr/core3/lucene/?q=+((title:北京 content:北京) (title:北平 content:北平) (title:首都 content:首都) (title:京城 content:京城) (title:京都 content:京都)) +((title:首都博物館 content:首都博物館) (title:首博 content:首博))&start=0&rows=10
查詢的各個關鍵詞會解析爲OR運算,並非我們的設計意圖,如果需要的話,可以修改LuceneQParser,將其中的”+“解析成MUST,才能按實際需要搜索。

下面介紹另外一種方法,直接擴展Solr的QParserPlugin。

首先,和前端設計定義統一的接口:

北京OR北平OR首都OR京城OR京都AND首都博物館OR首博  <=>  +((title:北京 content:北京) (title:北平 content:北平) (title:首都 content:首都) (title:京城 content:京城) (title:京都 content:京都)) +((title:首都博物館 content:首都博物館) (title:首博 content:首博))

我們通過在擴展的QParser中進行解析,代碼如下所示:

package org.shirdrn.solr.search;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.DisjunctionMaxQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.common.params.DefaultSolrParams;
import org.apache.solr.common.params.DisMaxParams;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.DisMaxQParser;
import org.apache.solr.util.SolrPluginUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Customized solr QParser of the plugin
 * 
 * @author shirdrn 2011/11/03
 */
public class SimpleQParser extends DisMaxQParser {
	private final Logger LOG = LoggerFactory.getLogger(SimpleQParser.class);
	// using low level Term query? For internal search usage.
	private boolean useLowLevelTermQuery = false;
	private float tiebreaker = 0f;
	private static Float mainBoost = 1.0f;
	private static Float frontBoost = 1.0f;
	private static Float rearBoost = 1.0f;
	private String userQuery = "";

	public SimpleQParser(String qstr, SolrParams localParams,
			SolrParams params, SolrQueryRequest req) {
		super(qstr, localParams, params, req);
	}
	
	@Override
	  public Query parse() throws ParseException {
	    SolrParams solrParams = localParams == null ? params : new DefaultSolrParams(localParams, params);
	    queryFields = SolrPluginUtils.parseFieldBoosts(solrParams.getParams(DisMaxParams.QF));
	    if (0 == queryFields.size()) {
	      queryFields.put(req.getSchema().getDefaultSearchFieldName(), 1.0f);
	    }
	    
	    /* the main query we will execute.  we disable the coord because
	     * this query is an artificial construct
	     */
	    BooleanQuery query = new BooleanQuery(true);
	    addMainQuery(query, solrParams);
	 // rewrite q parameter for highlighting
	    if(useLowLevelTermQuery) {
	    	query = new BooleanQuery(true);
	    	rewriteAndOrQuery(userQuery, query, solrParams);
	    }
	    addBoostQuery(query, solrParams);
	    addBoostFunctions(query, solrParams);
	    return query;
	  }

	protected void addMainQuery(BooleanQuery query, SolrParams solrParams)throws ParseException {
		tiebreaker = solrParams.getFloat(DisMaxParams.TIE, 0.0f);
		// get the comma separated list of fields used for payload

		/*
		 * a parser for dealing with user input, which will convert things to
		 * DisjunctionMaxQueries
		 */
		SolrPluginUtils.DisjunctionMaxQueryParser up = getParser(queryFields,DisMaxParams.QS, solrParams, tiebreaker);

		/* * * Main User Query * * */
	    parsedUserQuery = null;
	    userQuery = getString();
	    altUserQuery = null;
	    if (userQuery == null || userQuery.trim().length() < 1) {
	      // If no query is specified, we may have an alternate
	      altUserQuery = getAlternateUserQuery(solrParams);
	      query.add(altUserQuery, BooleanClause.Occur.MUST);
	    } else {
	      // There is a valid query string
	      userQuery = SolrPluginUtils.partialEscape(SolrPluginUtils.stripUnbalancedQuotes(userQuery)).toString();
	      userQuery = SolrPluginUtils.stripIllegalOperators(userQuery).toString();

	      // use low level Term for constructing TermQuery or BooleanQuery.
	      // warning: for internal AND, OR query, in order to integrate with Solr for obtaining highlight
	      String luceneQueryText = userQuery;
	      String q = solrParams.get(CommonParams.Q);
			if(q!=null && (q.indexOf("AND")!=-1 || q.indexOf("OR")!=-1)) {
	    	  addBasicAndOrQuery(luceneQueryText, query, solrParams);
	    	  luceneQueryText = query.toString();
	    	  useLowLevelTermQuery = true;
	      }
	      
	      LOG.debug("userQuery=" + luceneQueryText);
			parsedUserQuery = getUserQuery(luceneQueryText, up, solrParams);
			
			BooleanQuery rewritedQuery = rewriteQueries(parsedUserQuery);
			query.add(rewritedQuery, BooleanClause.Occur.MUST);
		}
	}
	
	protected void rewriteAndOrQuery(String userQuery, BooleanQuery query, SolrParams solrParams)throws ParseException {
		addBasicAndOrQuery(userQuery, query, solrParams);
	}
	
	/**
	 * Parse mixing MUST and SHOULD query defined by us, 
	 * e.g. 首都OR北京OR北平AND首博OR首都博物館
	 * @param userQuery
	 * @param query
	 * @param solrParams
	 * @throws ParseException
	 */
	protected void addBasicAndOrQuery(String userQuery, BooleanQuery query, SolrParams solrParams)throws ParseException {
		    userQuery = SolrPluginUtils.partialEscape(SolrPluginUtils.stripUnbalancedQuotes(userQuery)).toString();
		    userQuery = SolrPluginUtils.stripIllegalOperators(userQuery).toString();
		    LOG.debug("userQuery=" + userQuery);
		    BooleanQuery parsedUserQuery = new BooleanQuery(true);
		    String[] a = userQuery.split("\\s*AND\\s*");
			String q = "";
			if(a.length==0) {
				createTermQuery(parsedUserQuery, userQuery);
			} if(a.length>=3) {
				if(userQuery.indexOf("OR")==-1) { // e.g. 首都AND北京AND北平
					BooleanQuery andBooleanQuery = parseAndQuery(a);
					parsedUserQuery.add(andBooleanQuery, BooleanClause.Occur.MUST);
				}
			} else{
				if(a.length>0) {
					q = a[0].trim();
					if(q.indexOf("OR")!=-1 || q.length()>0) {
						parsedUserQuery.add(parseOrQuery(q, frontBoost), BooleanClause.Occur.MUST);
					}
				}
				if(a.length==2) {
					q = a[1].trim();
					if(q.indexOf("OR")!=-1 || q.length()>0) {
						parsedUserQuery.add(parseOrQuery(q, rearBoost), BooleanClause.Occur.MUST);
					}
				}
			}
			parsedUserQuery.setBoost(mainBoost);
			BooleanQuery rewritedQuery = rewriteQueries(parsedUserQuery);
			query.add(rewritedQuery, BooleanClause.Occur.MUST);
	}
	
	/**
	 * Parse SHOULD query, e.g. 北京OR北平OR首都
	 * @param ors
	 * @param boost
	 * @return
	 */
	private BooleanQuery parseOrQuery(String ors, Float boost) {
		BooleanQuery bq = new BooleanQuery(true);
		for(String or : ors.split("\\s*OR\\s*")) {
			if(!or.isEmpty()) {
				createTermQuery(bq, or.trim());
			}
		}
		bq.setBoost(boost);
		return bq;
	}

	/**
	 * Create TermQuery for some term text, query fields.
	 * @param bq
	 * @param qsr
	 */
	private void createTermQuery(BooleanQuery bq, String qsr) {
		for(String field : queryFields.keySet()) {
			TermQuery tq = new TermQuery(new Term(field, qsr));
			if(queryFields.get(field)!=null) {
				tq.setBoost(queryFields.get(field));
			}
			bq.add(tq, BooleanClause.Occur.SHOULD);
		}
	}
	
	/**
	 * Parse MUST query, e.g. 首都AND北京AND北平
	 * @param ands
	 * @return
	 */
	private BooleanQuery parseAndQuery(String[] ands) {
		BooleanQuery andBooleanQuery = new BooleanQuery(true);
		for(String and : ands) {
			if(!and.isEmpty()) {
				BooleanQuery bq = new BooleanQuery(true);
				createTermQuery(bq, and);
				andBooleanQuery.add(bq, BooleanClause.Occur.MUST);
			}
		}
		return andBooleanQuery;
	}
	
	/**
	 * Rewrite a query, especially a {@link BooleanQuery}, whose
	 * subclauses maybe include {@link BooleanQuery}s, {@link DisjunctionMaxQuery}s,
	 * {@link TermQuery}s, {@link PhraseQuery}s, {@link PayloadQuery}s, etc.
	 * @param input
	 * @return
	 */
	private BooleanQuery rewriteQueries(Query input) { 
		BooleanQuery output = new BooleanQuery(true);
		if(input instanceof BooleanQuery) {
			BooleanQuery bq = (BooleanQuery) input;
			for(BooleanClause clause : bq.clauses()) {
				if(clause.getQuery() instanceof DisjunctionMaxQuery) {
					BooleanClause.Occur occur = clause.getOccur();
					output.add(rewriteDisjunctionMaxQueries((DisjunctionMaxQuery) clause.getQuery()), occur); // BooleanClause.Occur.SHOULD
				} else {
					output.add(clause.getQuery(), clause.getOccur());
				}
			}
		} else if(input instanceof DisjunctionMaxQuery) {
			output.add(rewriteDisjunctionMaxQueries((DisjunctionMaxQuery) input), BooleanClause.Occur.SHOULD); // BooleanClause.Occur.SHOULD
		}
		output.setBoost(input.getBoost()); // boost main clause
		return output;
	}
	
	/**
	 * Rewrite the {@link DisjunctionMaxQuery}, because of default parsing
	 * query string to {@link PhraseQuery}s which are not what we want.
	 * @param input
	 * @return
	 */
	private BooleanQuery rewriteDisjunctionMaxQueries(DisjunctionMaxQuery input) { 
		// input e.g. (content:"吉林 長白山 內蒙古 九寨溝" | title:"吉林 長白山 內蒙古 九寨溝"^1.5)~1.0
		Map<String, BooleanQuery> m = new HashMap<String, BooleanQuery>();
		Iterator<Query> iter = input.iterator();
		while (iter.hasNext()) {
			Query query = iter.next();
			if(query instanceof PhraseQuery) {
				PhraseQuery pq = (PhraseQuery) query; // e.g. content:"吉林 長白山 內蒙古 九寨溝"
				for(Term term : pq.getTerms()) {
					BooleanQuery fieldsQuery = m.get(term.text());
					if(fieldsQuery==null) {
						fieldsQuery = new BooleanQuery(true);
						m.put(term.text(), fieldsQuery);
					}
					fieldsQuery.setBoost(pq.getBoost());
					fieldsQuery.add(new TermQuery(term), BooleanClause.Occur.SHOULD);
				}				
			} else if(query instanceof TermQuery) {
				TermQuery termQuery = (TermQuery) query;
				BooleanQuery fieldsQuery = m.get(termQuery.getTerm().text());
				if(fieldsQuery==null) {
					fieldsQuery = new BooleanQuery(true);
					m.put(termQuery.getTerm().text(), fieldsQuery);
				}
				fieldsQuery.setBoost(termQuery.getBoost());
				fieldsQuery.add(termQuery, BooleanClause.Occur.SHOULD);
			}
		}
		
		Iterator<Entry<String, BooleanQuery>> it = m.entrySet().iterator();
		BooleanQuery mustBooleanQuery = new BooleanQuery(true);
		while(it.hasNext()) {
			Entry<String, BooleanQuery> entry = it.next();
			BooleanQuery shouldBooleanQuery = new BooleanQuery(true);
			createTermQuery(shouldBooleanQuery, entry.getKey());
			mustBooleanQuery.add(shouldBooleanQuery, BooleanClause.Occur.MUST);
		}
		return mustBooleanQuery;
	}

}

接下來,QParser的plugin只需要使用上面實現SimpleQParser,非常容易,如下所示:

package org.shirdrn.solr.search;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;

/**
 * 
 * Simple query parser plugin.
 * e.g. search "Tokyo AND food"
 * 
 * @author shirdrn
 * @date   2011-11-03
 */
public class SimpleQParserPlugin extends QParserPlugin {
	@SuppressWarnings("rawtypes")
	@Override
	public void init(NamedList args) {
	}

	@Override
	public QParser createParser(String qstr, SolrParams localParams,
			SolrParams params, SolrQueryRequest req) {
		return new  SimpleQParser(qstr, localParams,params, req);
	}	
}
最後,在Solr的solrconfig.xml中配置好對應的requestHandler即可,配置片段示例如下所示:

  <queryParser name="simple" class="org.shirdrn.solr.search.SimpleQParserPlugin" />
  <requestHandler name="/simple" class="solr.SearchHandler">
        <lst name="defaults">
                <str name="defType">simple</str>
                <str name="qf">title^1.5 content</str>

                <str name="bf">recip(ms(NOW,publishDate),3.16e-13,1,1)^1.68</str>

                <str name="mainBoost">1.555</str>
                <str name="frontBoost">1.333</str>
                <str name="rearBoost">1.222</str>

                <str name="fl">*,score</str>
                <str name="qt">standard</str>
                <str name="wt">standard</str>
                <str name="version">2.2</str>
                <str name="echoParams">explicit</str>
                <bool name="hl">true</bool>
                <str name="hl.fl">title content</str>
                <int name="hl.snippets">3</int>

                <str name="indent">true</str>
                <str name="debugQuery">on</str>
                <str name="explainOther">on</str>
        </lst>
  </requestHandler>
下面,啓動Solr搜索服務器,通過搜索:

http://192.168.0.181:8080/solr/core/simple/?q=北京OR北平OR首都OR京城OR京都AND首都博物館OR首博&start=0&rows=10

就能達到我們的目的,搜索結果的xml格式響應,如下所示:

<result name="response" numFound="710" start="0" maxScore="2.5198267">
... ...
<lst name="debug">
<str name="rawquerystring">北京OR北平OR首都OR京城OR京都AND首都博物館OR首博</str>
<str name="querystring">北京OR北平OR首都OR京城OR京都AND首都博物館OR首博</str>
<str name="parsedquery">
+((+((content:北京 title:北京^1.5 content:北平 title:北平^1.5 content:首都 title:首都^1.5 content:京城 title:京城^1.5 content:京都 title:京都^1.5)^1.333) +((content:首都博物館 title:首都博物館^1.5 content:首博 title:首博^1.5)^1.222))^1.555) FunctionQuery(1.0/(3.16E-13*float(ms(const(1320330543420),date(publishDate)))+1.0))
</str>
<str name="parsedquery_toString">
+((+((content:北京 title:北京^1.5 content:北平 title:北平^1.5 content:首都 title:首都^1.5 content:京城 title:京城^1.5 content:京都 title:京都^1.5)^1.333) +((content:首都博物館 title:首都博物館^1.5 content:首博 title:首博^1.5)^1.222))^1.555) 1.0/(3.16E-13*float(ms(const(1320330543420),date(publishDate)))+1.0)
</str>
另外,如必要的時候,還可以擴展SolrDispatcherFilter,對HTTP請求參數進行精細地控制,實現更靈活的請求搜索方式。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章