ElasticSearch ik中文分詞安裝

前言

​ 在使用ElasticSearch做搜索時,語句的倒排索引可以說是十分關鍵。所以如果針對中文段落時,如果進行正確的分詞索引就是重中之重,接下來就介紹如何在ElasticSearch中安裝ik中文索引。(後文均簡稱ES)

正文

安裝步驟

  1. 插件下載:

    • 源項目地址

      點擊跳轉到ik項目打包好的發佈地址。選擇和你服務器安裝ES版本相近的ik版本,下載。

      下載安裝包

    • 下載地址

      如果github訪問有問題,可直接下載本人存在雲上的ik5.6.0版本。

  2. 解壓配置

    • 在ES_HOME/plugins/文件夾下新建ik文件夾

    • 將壓縮包內容解壓縮放到ik中

    • 項目文件結構

      項目結構

  3. 啓動ES

    此時啓動ES應該可以看到已加載ik分詞器

    裝載分詞器

測試分詞結果

普通分詞
POST {{host}}:{{port}}/_analyze
{
  "analyzer":"english",
  "text":"使用搜索引擎"
}
分詞結果:
{
    "tokens": [
        {
            "token": "使",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "用",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "搜",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "索",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "引",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "擎",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        }
    ]
}
ik_smart分詞
POST {{host}}:{{port}}/_analyze
{
  "analyzer":"ik_smart",
  "text":"使用搜索引擎"
}
{
    "tokens": [
        {
            "token": "使用",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "搜索引擎",
            "start_offset": 2,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}
ik_max_word
POST {{host}}:{{port}}/_analyze
{
  "analyzer":"ik_max_word",
  "text":"使用搜索引擎"
}
{
    "tokens": [
        {
            "token": "使用",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "搜索引擎",
            "start_offset": 2,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "搜索",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "索引",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "引擎",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

搜索分詞測試

// 創建index
PUT {{host}}:{{port}}/news  
// 創建mapping 並設置分詞器
POST {{host}}:{{port}}/news/sports/_mapping
{
    "properties":{
        "content":{
            "type":"text",
            "analyzer":"ik_max_word",
            "index":"analyzed"
        }
    }
}
導入數據....
搜索引擎內數據
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 4,
        "max_score": 1,
        "hits": [
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgyE7pGEKcCwwZuUe6",
                "_score": 1,
                "_source": {
                    "content": "熱火形勢一片大好"
                }
            },
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgx7fpGEKcCwwZuUe5",
                "_score": 1,
                "_source": {
                    "content": "火箭98-99不敵凱爾特人,慘遭四連敗"
                }
            },
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgyOLYGEKcCwwZuUe7",
                "_score": 1,
                "_source": {
                    "content": "曼城18連勝,英超無人能擋"
                }
            },
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgxyxXGEKcCwwZuUe4",
                "_score": 1,
                "_source": {
                    "content": "巴薩3-0擊敗皇馬贏下國家德比,梅西一球一助再獲滿分"
                }
            }
        ]
    }
}
POST {{host}}:{{port}}/news/sports/_search
{
    "query":{
        "match":{
            "content":"火箭隊新聞"
        }
    }   
}
{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.6099695,
        "hits": [
            {
                "_index": "news",
                "_type": "sports",
                "_id": "AWCgx7fpGEKcCwwZuUe5",
                "_score": 0.6099695,
                "_source": {
                    "content": "火箭98-99不敵凱爾特人,慘遭四連敗"
                }
            }
        ]
    }
}
POST {{host}}:{{port}}/news/sports/_search
{
    "query":{
        "match":{
            "content":"火焰"
        }
    }   
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
    }
}

通過分詞測試,可以看到中文分詞會將帶搜索字段分成更具中文含義的字段,而非每個字都分詞。

通過搜索測試,可以看到保留了相關性的搜索結果,而過濾掉了不相關的結果,是的搜索更智能化。

參考文章

​ 以下文章有關分詞均做了更多的解釋。如果想關注更多細節,可以查閱,本文不做更多介紹。

如何在Elasticsearch中安裝中文分詞器(IK+pinyin)

ik分詞細節

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章