solr進階四:創建文件索引

索引數據源並不會一定來自於數據庫、XMLJSONCSV這類結構化數據,很多時候也來自於PDFwordhtmlwordMP3等這類非結構化數據,從這類非結構化數據創建索引,solr也給我們提供了很好的支持,利用的是apache tika

下面我們來看看在solr4.10中如何從pdf文件創建索引。

先配置文件索引

新建core,存儲文件型索引,具體步驟參考:

http://blog.csdn.net/u011439289/article/details/41699009

導入jar

在工作目錄下新建一個extract文件夾,用來存放solr擴展的jar包。

\solr_tomcat\solr\pdf_core\extract

拷貝\solr-4.10.2\dist下的solr-cell-4.10.2.jarextract文件夾中,接着把

\solr-4.10.2\contrib\extraction\lib下的索引jar包拷貝到extract文件夾中。

配置solrconfig.xml

添加請求解析配置:


<requestHandler name="/extract" class="solr.extraction.ExtractingRequestHandler" >  
       <lst name="defaults">  
        <str name="fmap.content">text</str>  
        <str name="lowernames">true</str>  
        <str name="uprefix">attr_</str>  
        <str name="captureAttr">true</str>  
       </lst>  
</requestHandler>

指定依賴包位置:

<span style="font-size:18px;"><lib dir="extract" regex=".*\.jar" /></span>

注意,這個相對位置不是相對於配置文件所在文件夾位置,而是相對core主目錄的。比如我的配置文件在\solr_tomcat\solr\pdf_core\conf, 但是我的jar包在\solr_tomcat\solr\pdf_core\extract那麼我的相對路徑就是extract而不是../extract

配置schema.xml,配置索引字段的類型,也就是field類型。

其中text_general類型我們用到2txt文件(stopwords.txtsynonyms.txt),這2txt文件在發佈包示例core裏面有位置在:\solr_tomcat\solr\collection1\conf,複製這2txt文件到新建的core下面的conf目錄下,和schema.xml一個位置。

注意:如果是複製粘貼core來新建core的話,原來的配置文件有些field是已經定義的,要注意把重複定義的去掉一個!

<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>  
  <fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>  
  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">  
     <analyzer type="index">  
       <tokenizer class="solr.StandardTokenizerFactory"/>  
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />  
       <filter class="solr.LowerCaseFilterFactory"/>  
     </analyzer>  
     <analyzer type="query">  
       <tokenizer class="solr.StandardTokenizerFactory"/>  
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />  
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>  
       <filter class="solr.LowerCaseFilterFactory"/>  
     </analyzer>  
   </fieldType>

配置索引字段,也就是field

 

其中有個動態類型字段,attr_*,這個是什麼意思呢。也就是solr在解析文件的時候,文件本身有很多屬性,具體有哪些屬性是不確定的,solr全部把他解析出來以attr作爲前綴加上文件本身的屬性名,組合在一起就成了field的名稱

<field name="id"        type="string"       indexed="true"  stored="true"  multiValued="false" required="true"/>  
 <field name="text"      type="text_general" indexed="true"  stored="true"/>  
 <field name="_version_" type="long"         indexed="true"  stored="true"/>  
   
 <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

到這裏solr服務端的配置以及完成了。

 

測試類CreateIndexFromPDF.java

需要的jar包在前面solr進階一:java代碼添加索引和增加IKAnalyzer分詞支持》這篇文章有指定。 

Solrj4.10裏面ContentStreamUpdateRequestaddFile方法多了一個contentType參數,指明內容類型。ContentType請參看:ContentType

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;

import java.io.File;
import java.io.IOException;
/**
 * Created by Lhx on 14-12-4.
 */
public class CreateIndexFromPDF {

    public static void indexFilesSolr(String fileName, String solrId) throws IOException, SolrServerException {
        String urlString = "http://localhost:8080/solr/pdf_core";
        SolrServer solr = new HttpSolrServer(urlString);
        ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/extract");
        String contentType = "application/pdf";
        up.addFile(new File(fileName), contentType);
        up.setParam("literal.id", solrId);
        up.setParam("uprefix","attr_");
        up.setParam("fmap.content","attr_content");
        up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);

        solr.request(up);

        QueryResponse rsp = solr.query(new SolrQuery("*:*"));
        System.out.println(rsp);
    }

    public static void main(String[] args) {
        String fileName = "F:\\Sencha_Touch_2.0用戶指南(中文版).pdf";
        String solrId = "Sencha_Touch_2.0用戶指南(中文版).pdf";
        try {
            indexFilesSolr(fileName,solrId);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (SolrServerException e) {
            e.printStackTrace();
        }
    }
}

執行上面代碼,便把我們的pdf文件上傳到solr服務器,解析、創建索引

後面的solr.query是執行一個查詢,查詢解析索引後結果。解析後pdf就變成了純文本的內容,在控制檯可以看到很多文檔其他信息。

Solr解析完pdf、創建索引後,我們也可以在solr的管理界面查看索引結果。如下圖。

選擇“Query”,直接點擊“Execute Query”按鈕就可以了:

後記:

重啓tomcat後報重複定義字段的錯誤,這個在前面的實踐中就有這個錯誤,所以很快就在schema.xml中找到重複定義的idlong等類型字段,刪掉就可以了。

接着啓動tomcat,還是報出無法加載某某jar包的提示錯誤,後來才發現

<lib dir="extract" regex=".*\.jar" />

這個dir指定的目錄地址寫錯了,導致tomcat報錯。

啓動tomcat後再也沒有報錯,在java控制檯執行代碼,報出以下錯誤:

原來是我把urlString地址寫錯了,寫成了:

http://localhost:8080/solr

沒有指定究竟上傳到哪個指定的core裏面,修改後就能提交PDF文檔信息了。

 

附錄:

solrconfig.xml

<?xml version="1.0" encoding="UTF-8" ?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<!--
 This is a stripped down config file used for a simple example...  
 It is *not* a good example to work from. 
-->
<config>
    <luceneMatchVersion>4.10.2</luceneMatchVersion>
    <!--  The DirectoryFactory to use for indexes.
          solr.StandardDirectoryFactory, the default, is filesystem based.
          solr.RAMDirectoryFactory is memory based, not persistent, and doesn't work with replication. -->
    <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

    <dataDir>${solr.core0.data.dir:}</dataDir>

    <!-- To enable dynamic schema REST APIs, use the following for <schemaFactory>:
    
         <schemaFactory class="ManagedIndexSchemaFactory">
           <bool name="mutable">true</bool>
           <str name="managedSchemaResourceName">managed-schema</str>
         </schemaFactory>
         
         When ManagedIndexSchemaFactory is specified, Solr will load the schema from
         he resource named in 'managedSchemaResourceName', rather than from schema.xml.
         Note that the managed schema resource CANNOT be named schema.xml.  If the managed
         schema does not exist, Solr will create it after reading schema.xml, then rename
         'schema.xml' to 'schema.xml.bak'. 
         
         Do NOT hand edit the managed schema - external modifications will be ignored and
         overwritten as a result of schema modification REST API calls.
  
         When ManagedIndexSchemaFactory is specified with mutable = true, schema
         modification REST API calls will be allowed; otherwise, error responses will be
         sent back for these requests. 
    -->
    <schemaFactory class="ClassicIndexSchemaFactory"/>

    <updateHandler class="solr.DirectUpdateHandler2">
        <updateLog>
            <str name="dir">${solr.core0.data.dir:}</str>
        </updateLog>
    </updateHandler>

    <!-- realtime get handler, guaranteed to return the latest stored fields 
      of any document, without the need to commit or open a new searcher. The current 
      implementation relies on the updateLog feature being enabled. -->
    <requestHandler name="/get" class="solr.RealTimeGetHandler">
        <lst name="defaults">
            <str name="omitHeader">true</str>
        </lst>
    </requestHandler>

    <requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy"/>

    <requestDispatcher handleSelect="true">
        <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" formdataUploadLimitInKB="2048"/>
    </requestDispatcher>

    <requestHandler name="standard" class="solr.StandardRequestHandler" default="true"/>
    <requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler"/>
    <requestHandler name="/update" class="solr.UpdateRequestHandler"/>
    <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers"/>

    <requestHandler name="/admin/ping" class="solr.PingRequestHandler">
        <lst name="invariants">
            <str name="q">solrpingquery</str>
        </lst>
        <lst name="defaults">
            <str name="echoParams">all</str>
        </lst>
    </requestHandler>

    <!--新添加的內容-->
    <requestHandler name="/extract" class="solr.extraction.ExtractingRequestHandler">
        <lst name="defaults">
            <str name="fmap.content">text</str>
            <str name="lowernames">true</str>
            <str name="uprefix">attr_</str>
            <str name="captureAttr">true</str>
        </lst>
    </requestHandler>

    <lib dir="extract" regex=".*\.jar"/>


    <!-- config for the admin interface -->
    <admin>
        <defaultQuery>solr</defaultQuery>
    </admin>

</config>

schema.xml

<?xml version="1.0" ?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<schema name="example core zero" version="1.1">

    <!-- general -->

    <field name="type" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="name" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="core0" type="string" indexed="true" stored="true" multiValued="false"/>

    <!-- field to use to determine and enforce document uniqueness. -->
    <uniqueKey>id</uniqueKey>

    <!-- field for the QueryParser to use when an explicit fieldname is absent -->
    <defaultSearchField>name</defaultSearchField>

    <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
    <solrQueryParser defaultOperator="OR"/>

    <!--新添加的,其中long、String等字段原來配置文件就有,注意刪除-->
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>

    <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="text" type="text_general" indexed="true" stored="true"/>
    <field name="_version_" type="long" indexed="true" stored="true"/>

    <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

</schema>

參考文章:

Solr4.7從文件創建索引


發佈了57 篇原創文章 · 獲贊 13 · 訪問量 37萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章