elasticsearch IK 中文分詞器精確查詢

原創

水桶前辈

2020-04-28 09:46

Mac下安裝ElasticSearch 6.6.2

MAC下安裝ElasticSearch Head插件

在上面2篇文章的基礎上，來學習下IK

IKAnalyzer: 免費開源的java分詞器,目前比較流行的中文分詞器之一,簡單,穩定,想要特別好的效果,需要自行維護詞庫,支持自定義詞典。

一. 安裝ik分詞器插件

下載地址：https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.4.2，版本如下圖：

然後在elasticsearch下的plugins文件下創建一個ik文件夾，將上面的壓縮包解壓到這裏。

修改ik文件下的plugin-descriptor.properties文件，設置版本號（我這裏用的elasticsearch的版本號是6.2.2，ik版本號是6.3.0，官網說同一個大版本號都是可以用的，這裏大版本是6），如下：

description=IK Analyzer for Elasticsearch
#
# 'version': plugin's version
version=6.3.0
#
# 'name': the plugin name
name=analysis-ik
#
# 'classname': the name of the class to load, fully-qualified.
classname=org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin
#
# 'java.version' version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=1.8
#
# 'elasticsearch.version' version of elasticsearch compiled against
# You will have to release a new version of the plugin for each new
# elasticsearch release. This version is checked when the plugin
# is loaded so Elasticsearch will refuse to start in the presence of
# plugins with the incorrect elasticsearch.version.
elasticsearch.version=6.2.2

然後重啓elasticserch：切到bin文件夾下執行命令elasticsearch。

啓動成功界面如下所示：

二. 常見ES接口測試

1. 創建一個index

http://localhost:9200/es  PUT

執行結果：
{
    "acknowledged": true,
    "shards_acknowledged": true,
    "index": "es"
}

2.創建表，設置分詞create a mapping

http://localhost:9200/es/_mapping/doc  POST
參數：Content-Type  application/json
{
  "properties":{
    "content":{
      "type":"text",
      "analyzer":"ik_max_word",
      "search_analyzer":"ik_smart"
    }
  }
}

執行結果：
{
    "acknowledged": true
}

3.添加數據，添加4條測試數據

http://localhost:9200/es/doc/1  POST
header： Content-Type ：application/json
參數：
{
  "content":"美國留給伊拉克的是個爛攤子嗎"
}

http://localhost:9200/es/doc/2  POST
header： Content-Type ：application/json
{
  "content":"公安部：各地校車將享最高路權"
}

http://localhost:9200/es/doc/3  POST
header： Content-Type ：application/json
{
  "content":"中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"
}

http://localhost:9200/es/doc/4  POST
header： Content-Type ：application/json
{
  "content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}
執行結果：
{
    "_index": "es",
    "_type": "doc",
    "_id": "4",
    "_version": 1,
    "result": "created",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 1,
    "_primary_term": 1
}

4. 分詞查找，如下查找到2條數據【match查詢（注意，match查詢只能是針對單個字段）】

http://localhost:9200/es/_search  POST
參數：
{
  "query": {
    "match": {
      "content": "中國"
    }
  }
}

執行結果：
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.6489038,
        "hits": [
            {
                "_index": "es",
                "_type": "doc",
                "_id": "4",
                "_score": 0.6489038,
                "_source": {
                    "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "3",
                "_score": 0.2876821,
                "_source": {
                    "content": "中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"
                }
            }
        ]
    }
}

5. 刪除index

http://localhost:9200/index  delete

6.分析分詞_analyze

http://localhost:9200/_analyze  POST
參數：
{
	"analyzer":"ik_max_word",
	"text":"中國人"
}

執行結果：
{
    "tokens": [
        {
            "token": "中國人",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中國",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "國人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

三. 精確合併查詢 (and，or)

1. 全部數據如下：

http://localhost:9200/es/_search  查所有數據，如下：

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 6,
        "max_score": 1,
        "hits": [
            {
                "_index": "es",
                "_type": "doc",
                "_id": "5",
                "_score": 1,
                "_source": {
                    "content": "美國人特朗普垃圾"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "4",
                "_score": 1,
                "_source": {
                    "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "2",
                "_score": 1,
                "_source": {
                    "content": "公安部：各地校車將享最高路權"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "6",
                "_score": 1,
                "_source": {
                    "content": "美航空母艦國人"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "content": "美國人111"
                }
            },
            {
                "_index": "es",
                "_type": "doc",
                "_id": "3",
                "_score": 1,
                "_source": {
                    "content": "中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"
                }
            }
        ]
    }
}

2. or查詢

http://localhost:9200/es/_search 
參數：
{
  "query": {
    "match": {
      "content": "美國人1"   
    }
  }
}

通過上面的結果可以分析出：這是因爲解析器會將”美國人111“，拆分爲了2個詞“美國人”和“111”，而且默認的操作符是or，所以查到了如上的2條數據。

3. 利用and查詢，實現精確查詢

http://localhost:9200/es/_search?pretty=true  POST

{
  "query": {
      "match": {
	        "content": {
		        "query": "美國人111",
		        "operator": "and"
		        }
             }
        }
}

通過以上的結果可以看出，查詢的結果content中包含“美國人”和“111”的記錄，用了and，故精確的查到了一條數據

4.“美國人111”上面的分詞，可以被劃分爲下面的幾個詞，如下截圖：

ik_max_word: 會將文本做最細粒度的拆分，比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”，會窮盡各種可能的組合，適合 Term Query；

ik_smart: 會做最粗粒度的拆分，比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,國歌”，適合 Phrase 查詢。

參考：https://www.jianshu.com/p/362f85ebf383

https://www.cnblogs.com/cjsblog/p/9910788.html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

elasticsearch IK 中文分詞器精確查詢

一. 安裝ik分詞器插件

二. 常見ES接口測試

三. 精確合併查詢 (and，or)

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Mac 下啓用redis server和client

Linux 日誌打印到文件的2種方式

Java 編碼技巧

攔截器中，request中getReader()和getInputStream()只能調用一次，構建可重複讀取inputStream的request.

Linux 開通3306端口 8080端口

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

elasticsearch IK 中文分詞器 精確查詢

一. 安裝ik分詞器插件

二. 常見ES接口測試

三. 精確合併查詢 (and，or)

elasticsearch IK 中文分詞器精確查詢