ES 分詞器和自定義分詞器

原創

aganliang

2020-06-17 03:16

1.analysis 和 analyzer

analysis是指把全文本轉換成一系列單詞(term/token)的過程，也叫分詞。

analysis是通過分詞器analyzer來實現的。

2.ES 自帶分詞器

Standard Analyzer——默認分詞器，按詞切分，小寫處理
Simple Analyzer——按照非字母切分（符號被過濾），小寫處理
Stop Analyzer——小寫處理，停用詞過濾(the,a,is)
Whitespace Analyzer——按照空格切分，不轉小寫
Keyword Analyze——不分詞，直接將輸入當作輸出
Patter Analyzer——正則表達式，默認\W+(非字符分隔)
Language——提供了30多種常見語言的分詞器
Customer Analyzer——自定義分詞器

3._analyzer API

# 直接指定分詞器進行測試
GET /_analyze
{
    "analyzer":"standard",
    "text":"master elasticsearch!"
}

#指定索引的字段進行測試
POST books/_analyze
{
    "field":"standard",
    "text":"master elasticsearch!"
}

#自定義分詞器進行測試
POST /_analyze
{
    "tokenizer":"standard",
    "filter":["lowercase"],
    "text":"master elasticsearch!"
}

中文分詞器
# IK

https://github.com/medcl/elasticsearch-analysis-ik

# 清華大學開發的中文分詞器

https://github.com/thunlp/THULAC

mapping 中配置自定義analyzer
es自帶分詞器無法滿足需求時，可以自定義分詞器，通過組合不同的組件實現，
包括三個組件：
1.character filter
對文本進行處理，增加、刪除、替換字符，可以配置多個，數組的形式
自帶的有：html strip(去除html標籤),mapping（字符串替換）,pattern replace（正則匹配替換）
2.tokenizer
分詞器
3.token filter
對分詞器tokenizer輸出的單詞term，進行增加、修改、刪除
自帶的有：lowercase,stop,synonym

# 過濾html 標籤
POST _analyze
{
	"tokenizer":"keyword",
	"char_filter":["html_strip"],
	"text":"<b>hello world<b>"
}

# 字符轉換
POST _analyze
{
	"tokenizer":"standard",
	"char_filter":[
			{
				"type":"mapping",
				"mapping":["- => _"]
			}
		],
	"text":"123-456-789,i-love-u"
}

# 替換表情符號
POST _analyze
{
	"tokenizer":"standard",
	"char_filter":[
		{
			"type":"mapping",
			"mapping":[":) => happy"]
		}
	],
	"text":"i am felling :),i-love-u"
}


# 正則表達式
POST _analyze
{
	"tokenizer":"standard",
	"char_filter":[
		{
			"type":"pattern_replace",
			"pattern":"http://(.*)",
			"replacement":"$1"
		}
	],
	"text":"http://www.elastic.co"
}

# 路徑分詞器，按照一級一級的目錄切成不同term
POST _analyze
{
	"tokenizer":"path_hierarchy",
	"text":"/user/ymruan/a/b/c/d/e"
}


# stop 過濾
POST _analyze
{
	"tokenizer":"whitespace",
	"filter":["lowercase","stop"],
	"text":["The rain in spain falls mainly on the plain."]
}

設置索引的settings
PUT my_index
{
	"settings":{
		"analysis":{
			"analyzer":{
				"my_custom_analyzer":{
					"type":"custom",
					"char_filter":["emoticons"],
					"tokenizer":"punctuation",
					"filter":["lowercase","english_stop"]
				}
			},
			"tokenizer":{
				"punctuation":{
					"type":"pattern",
					"pattern":"[ .,!?]"
				}
			},
			"char_filter":{
				"emotions":{
					"type":"mapping",
					"mappings":[
						":) => happy"
					]
				}
			},
			"filter":{
				"english_stop":{
					"type":"stop",
					"stopwords":"_english_"
				}
			}
		}
	}
}

# 使用上面自己定義的分詞器
POST my_index/_analyze
{
	"analyzer":"my_custom_analyzer",
	"text":"i am a :) person,and you ?"
}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

ES 分詞器和自定義分詞器

nginx 設置允許跨域請求

mapbox-gl 展示 cluster圖層

mapbox-gl js Customize camera animations例子（CSS、DOM、JS 傳值和交互）

關於eslint的簡單介紹

node-fontnik 部署安裝和簡單使用（基於Centos 7）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結