在上面2篇文章的基礎上,來學習下IK
IKAnalyzer: 免費開源的java分詞器,目前比較流行的中文分詞器之一,簡單,穩定,想要特別好的效果,需要自行維護詞庫,支持自定義詞典。
一. 安裝ik分詞器插件
下載地址:https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.4.2,版本如下圖:
然後在elasticsearch下的plugins文件下創建一個ik文件夾,將上面的壓縮包解壓到這裏。
修改ik文件下的plugin-descriptor.properties文件,設置版本號(我這裏用的elasticsearch的版本號是6.2.2,ik版本號是6.3.0,官網說同一個大版本號都是可以用的,這裏大版本是6),如下:
description=IK Analyzer for Elasticsearch
#
# 'version': plugin's version
version=6.3.0
#
# 'name': the plugin name
name=analysis-ik
#
# 'classname': the name of the class to load, fully-qualified.
classname=org.elasticsearch.plugin.analysis.ik.AnalysisIkPlugin
#
# 'java.version' version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=1.8
#
# 'elasticsearch.version' version of elasticsearch compiled against
# You will have to release a new version of the plugin for each new
# elasticsearch release. This version is checked when the plugin
# is loaded so Elasticsearch will refuse to start in the presence of
# plugins with the incorrect elasticsearch.version.
elasticsearch.version=6.2.2
然後重啓elasticserch:切到bin文件夾下執行命令elasticsearch。
啓動成功界面如下所示:
二. 常見ES接口測試
1. 創建一個index
http://localhost:9200/es PUT
執行結果:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "es"
}
2.創建表,設置分詞create a mapping
http://localhost:9200/es/_mapping/doc POST
參數:Content-Type application/json
{
"properties":{
"content":{
"type":"text",
"analyzer":"ik_max_word",
"search_analyzer":"ik_smart"
}
}
}
執行結果:
{
"acknowledged": true
}
3.添加數據,添加4條測試數據
http://localhost:9200/es/doc/1 POST
header: Content-Type :application/json
參數:
{
"content":"美國留給伊拉克的是個爛攤子嗎"
}
http://localhost:9200/es/doc/2 POST
header: Content-Type :application/json
{
"content":"公安部:各地校車將享最高路權"
}
http://localhost:9200/es/doc/3 POST
header: Content-Type :application/json
{
"content":"中韓漁警衝突調查:韓警平均每天扣1艘中國漁船"
}
http://localhost:9200/es/doc/4 POST
header: Content-Type :application/json
{
"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}
執行結果:
{
"_index": "es",
"_type": "doc",
"_id": "4",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1
}
4. 分詞查找,如下查找到2條數據【match查詢(注意,match查詢只能是針對單個字段)】
http://localhost:9200/es/_search POST
參數:
{
"query": {
"match": {
"content": "中國"
}
}
}
執行結果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.6489038,
"hits": [
{
"_index": "es",
"_type": "doc",
"_id": "4",
"_score": 0.6489038,
"_source": {
"content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "3",
"_score": 0.2876821,
"_source": {
"content": "中韓漁警衝突調查:韓警平均每天扣1艘中國漁船"
}
}
]
}
}
5. 刪除index
http://localhost:9200/index delete
6.分析分詞_analyze
http://localhost:9200/_analyze POST
參數:
{
"analyzer":"ik_max_word",
"text":"中國人"
}
執行結果:
{
"tokens": [
{
"token": "中國人",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": "中國",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": "國人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 2
}
]
}
三. 精確合併查詢 (and,or)
1. 全部數據如下:
http://localhost:9200/es/_search 查所有數據,如下:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 6,
"max_score": 1,
"hits": [
{
"_index": "es",
"_type": "doc",
"_id": "5",
"_score": 1,
"_source": {
"content": "美國人特朗普垃圾"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "4",
"_score": 1,
"_source": {
"content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "2",
"_score": 1,
"_source": {
"content": "公安部:各地校車將享最高路權"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "6",
"_score": 1,
"_source": {
"content": "美航空母艦國人"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"content": "美國人111"
}
},
{
"_index": "es",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"content": "中韓漁警衝突調查:韓警平均每天扣1艘中國漁船"
}
}
]
}
}
2. or查詢
http://localhost:9200/es/_search
參數:
{
"query": {
"match": {
"content": "美國人1"
}
}
}
通過上面的結果可以分析出: 這是因爲解析器會將”美國人111“,拆分爲了2個詞“美國人”和“111”,而且默認的操作符是or,所以查到了如上的2條數據 。
3. 利用and查詢,實現精確查詢
http://localhost:9200/es/_search?pretty=true POST
{
"query": {
"match": {
"content": {
"query": "美國人111",
"operator": "and"
}
}
}
}
通過以上的結果可以看出,查詢的結果content中包含“美國人”和“111”的記錄 ,用了and,故精確的查到了一條數據
4.“美國人111”上面的分詞,可以被劃分爲下面的幾個詞,如下截圖:
ik_max_word: 會將文本做最細粒度的拆分,比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”,會窮盡各種可能的組合,適合 Term Query;
ik_smart: 會做最粗粒度的拆分,比如會將“中華人民共和國國歌”拆分爲“中華人民共和國,國歌”,適合 Phrase 查詢。
參考:https://www.jianshu.com/p/362f85ebf383