Elasticsearch核心技術與實戰學習筆記第三章 20多字段特性及Mapping中配置自定義Analyzer

原創

bohu83

2020-05-19 10:41

一序

本文屬於極客時間Elasticsearch核心技術與實戰學習筆記系列。

二多字段類型

多字段特性

廠家名字實現精確匹配
- 增加一個 keyword 字段
使用不同的 analyzer
- 不同語言
- pinyin 字段的搜索
- 還支持爲搜索和索引指定不同的 analyzer

Excat values v.s Full Text

Excat Values ：包括數字 / 日期 / 具體一個字符串（例如 “Apple Store”）

Elasticsearch 中的 keyword

全文本，非結構化的文本數據

Elasticsearch 中的 text

Exact Value不需要被分詞

Elaticsearch 爲每一個字段創建一個倒排索引

Exact Value 在索引時，不需要做特殊的分詞處理

三自定義分詞器

當 Elasticsearch 自帶的分詞器無法滿足時，可以自定義分詞器。通過自組合不同的組件實現自定義的分析器。

Character Filter
Tokenizer
Token Filter

你可以通過在一個適合你的特定數據的設置之中組合字符過濾器、分詞器、詞彙單元過濾器來創建自定義的分析器。按這三種照順序執行。

3.1Character Filters

在 Tokenizer 之前對文本進行處理，例如增加刪除及替換字符。可以配置多個 Character Filters。會影響 Tokenizer 的 position 和 offset 信息
一些自帶的 Character Filters

HTML strip - 去除 html 標籤
Mapping - 字符串替換
Pattern replace - 正則匹配替換

3.2Tokenizer

將原始的文本按照一定的規則，切分爲詞（term or token）
Elasticsearch 內置的 Tokenizers

whitespace | standard | uax_url_email | pattern | keyword | path hierarchy

可以用 JAVA 開發插件，實現自己的 Tokenizer

3.3Token Filters

將 Tokenizer 輸出的單詞，進行增加、修改、刪除
自帶的 Token Filters
- Lowercase |stop| synonym（添加近義詞）

Demo

//結果過濾掉html字符。

#使用char filter進行替換
POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "text": "123-456, I-test! test-990 650-555-1234"
}

結果：中劃線替換爲下劃線

tokens" : [
    {
      "token" : "123_456",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "I_test",
      "start_offset" : 9,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },

char filter 替換表情符號

正則表達式

替換掉了HTTP://

// whitespace與stop
GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["stop","snowball"],
  "text": ["The rain in Spain falls mainly on the plain."]
}

返回：帶着第一個大寫的The, in、 on 去掉了。mainly變成main

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "rain",
      "start_offset" : 4,
      "end_offset" : 8,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Spain",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "fall",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "main",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "plain.",
      "start_offset" : 38,
      "end_offset" : 44,
      "type" : "word",
      "position" : 8
    }
  ]
}

//remove 加入lowercase後，The被當成 stopword刪除

自定義 analyzer

先看下官網的demo。2.X版本的

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

demo，


#定義自己的分詞器
PUT my_index
{
"settings": {
  "analysis": {
    "analyzer": {
      "my_custom_analyzer":{
        "type":"custom",
        "char_filter":[
          "emoticons"
        ],
        "tokenizer":"punctuation",
        "filter":[
          "lowercase",
          "english_stop"
        ]
      }
    },
    "tokenizer": {
      "punctuation":{
        "type":"pattern",
        "pattern": "[ .,!?]"
      }
    },
    "char_filter": {
      "emoticons":{
        "type":"mapping",
        "mappings" : [ 
          ":) => happy",
          ":( => sad"
        ]
      }
    },
    "filter": {
      "english_stop":{
        "type":"stop",
        "stopwords":"_english_"
      }
    }
  }
}
}

執行：

POST my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": ["I am felling :)", "Feeling :( today"]
}

指定了索引，指定了分詞器：結果就是我們想要的。

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "felling",
      "start_offset" : 5,
      "end_offset" : 12,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "feeling",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "word",
      "position" : 104
    },
    {
      "token" : "sad",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "word",
      "position" : 105
    },
    {
      "token" : "today",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 106
    }
  ]
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Elasticsearch核心技術與實戰學習筆記第三章 20多字段特性及Mapping中配置自定義Analyzer

一序

二多字段類型

Excat values v.s Full Text

三自定義分詞器

3.1Character Filters

3.2Tokenizer

3.3Token Filters

Demo

自定義 analyzer

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Elasticsearch核心技術與實戰學習筆記 49 | 對象及Nested對象

Elasticsearch核心技術與實戰學習筆記 29 | 單字符串多字段查詢：Multi Match

Elasticsearch核心技術與實戰學習筆記 36 | 配置跨集羣搜索

jackson 解析json報錯：Cannot deserialize instance of `java.lang.String` out of START_OBJECT token

Elasticsearch核心技術與實戰學習筆記 34 | Term&Phrase Suggester

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Elasticsearch核心技術與實戰學習筆記 第三章 20多字段特性及Mapping中配置自定義Analyzer

一 序

二 多字段類型

Excat values v.s Full Text

三 自定義分詞器

3.1Character Filters

3.2Tokenizer

3.3Token Filters

Demo

自定義 analyzer

Elasticsearch核心技術與實戰學習筆記第三章 20多字段特性及Mapping中配置自定義Analyzer

一序

二多字段類型

三自定義分詞器