搜索引擎–elasticsearch python客戶端pyes 建立索引和搜索

主機環境:Ubuntu 13.04

Python版本：2.7.4

轉載請標明：http://blog.geekcome.com/archives/118

官方站點：http://www.elasticsearch.com/

中文站點：http://es-cn.medcl.net/

下面一段介紹引用自中文站點：

好吧，假如你建了一個web站點或者是一個應用程序，你就可能會需要添加搜索功能（因爲這太有必要了），而事實上讓搜索跑起來是有難度的，我們不僅想要搜索的速度快，而且還要安裝方便（最好是無痛安裝），另外模式定義要非常自由（schema free），可以通過HTTP以JSON格式的數據來進行索引，服務器必須是一直可用的（HA高可用，這個不能丟），從一臺機器能夠擴展到成千上萬臺，然後搜索必須是實時的（real-time），使用起來一定要簡單、支持多租戶，我們需要一整套的解決方案，並且是爲雲構建的。
“讓搜索更簡單”，這是我們的宣言，“並且要酷，像盆景一樣”
elasticsearch 的目標是解決上面的所有問題以及更多。她是開源的（Apache2協議），分佈式的，RESTful的，構建在Apache Lucene之上的的搜索引擎.

1 、分佈式服務器的安裝：

首先下載http://www.elasticsearch.org/download/，選擇合適的版本安裝，這裏直接下載了適合ubuntu的DEB包，下載完成後直接dpkg命令安裝。安裝完成後可以通過

sudo service elasticsearch start

來啓動服務。

2、安裝pyes客戶端

使用命令

`1`	`pip install pyes`

安裝elasticsearch的python的組件。

3、安裝pyes的中文分詞組件

直接下載https://github.com/medcl/elasticsearch-rtf/blob/master/elasticsearch/plugins/analysis-ik/elasticsearch-analysis-ik-1.2.2.jar中文分詞組件

然後移動的elasticsearch的安裝目錄/usr/share/elasticsearch/analysis-ik/,

修改配置文件/etc/elasticsearch/elasticsearch.yml

設置插件的路徑

path.plugins: /usr/share/elasticsearch/plugins

並添加分詞組建配置

1index:
2  analysis:
3    analyzer:
4      ik:
5        alias: [ik_analyzer]
6        type: org.elasticsearch.index.analysis.IkAnalyzerProvider

最後下載IK分詞使用的詞典

cd /etc/elasticsearch
wget http://github.com/downloads/medcl/elasticsearch-analysis-ik/ik.zip –no-check-certificate
unzip ik.zip
rm ik.zip

重啓elasticsearch服務即可。

4、建立索引

01#!/usr/bin/env python
02#-*- coding:utf-8-*-
03import os
04import sys
05from pyes import *
06
07INDEX_NAME='txtfiles'
08
09class IndexFiles(object):
10    def __init__(self,root):
11        conn = ES('127.0.0.1:9200', timeout=3.5)#連接ES
12        try:
13            conn.delete_index(INDEX_NAME)
14            #pass
15        except:
16            pass
17        conn.create_index(INDEX_NAME)#新建一個索引
18
19        #定義索引存儲結構
20        mapping = {u'content': {'boost': 1.0,
21                          'index': 'analyzed',
22                          'store': 'yes',
23                          'type': u'string',
24                          "indexAnalyzer":"ik",
25                          "searchAnalyzer":"ik",
26                          "term_vector" : "with_positions_offsets"},
27                  u'name': {'boost': 1.0,
28                             'index': 'analyzed',
29                             'store': 'yes',
30                             'type': u'string',
31                             "indexAnalyzer":"ik",
32                             "searchAnalyzer":"ik",
33                             "term_vector" : "with_positions_offsets"},
34                  u'dirpath': {'boost': 1.0,
35                             'index': 'analyzed',
36                             'store': 'yes',
37                             'type': u'string',
38                             "indexAnalyzer":"ik",
39                             "searchAnalyzer":"ik",
40                             "term_vector" : "with_positions_offsets"},
41        }
42
43        conn.put_mapping("test-type", {'properties':mapping}, [INDEX_NAME])#定義test-type
44
45        self.addIndex(conn,root)
46
47        conn.default_indices=[INDEX_NAME]#設置默認的索引
48        conn.refresh()#刷新以獲得最新插入的文檔
49
50    def addIndex(self,conn,root):
51        print root
52        for root, dirnames, filenames in os.walk(root):
53            for filename in filenames:
54                if not filename.endswith('.txt'):
55                    continue
56                print "Indexing file ", filename
57                try:
58                    path=os.path.join(root,filename)
59                    file=open(path)
60                    contents = unicode(file.read(),'utf-8')
61                    file.close()
62                    if len(contents) > 0:
63                        conn.index({'name':filename, 'dirpath':root, 'content':contents},INDEX_NAME,'test-type')
64                    else:
65                        print 'no contents in file %s',path
66                except Exception,e:
67                    print e
68
69if __name__ == '__main__':
70    IndexFiles('./txtfiles')

5、搜索並高亮顯示

01#!/usr/bin/env python
02#-*- coding:utf-8 -*-
03
04import os
05import sys
06from pyes import *
07
08conn = ES('127.0.0.1:9200', timeout=3.5)#連接ES
09sq=StringQuery(u'世界末日','content')
10h=HighLighter(['<b>'], ['</b>'], fragment_size=20)
11
12s=Search(sq,highlight=h)
13s.add_highlight("content")
14results=conn.search(s,indices='txtfiles',doc_types='test-type')
15
16list=[]
17for r in results:
18    if(r._meta.highlight.has_key("content")):
19        r['content']=r._meta.highlight[u"content"][0]
20    list.append(r)
21    print r['content']
22print len(list)