Elasticsearch 5.5 Mapping詳解

 


前言

 


 

一、Field datatype(字段數據類型)

1.1string類型

ELasticsearch 5.X之後的字段類型不再支持string,由text或keyword取代。 如果仍使用string,會給出警告。

測試:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type":  "string"
        }
      }
    }
  }
}

結果:

#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [title]
{
  "acknowledged": true,
  "shards_acknowledged": true
}

1.2 text類型

text取代了string,當一個字段是要被全文搜索的,比如Email內容、產品描述,應該使用text類型。設置text類型以後,字段內容會被分析,在生成倒排索引以前,字符串會被分析器分成一個一個詞項。text類型的字段不用於排序,很少用於聚合(termsAggregation除外)。

把full_name字段設爲text類型的Mapping如下:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "full_name": {
          "type":  "text"
        }
      }
    }
  }
}

1.3 keyword類型

keyword類型適用於索引結構化的字段,比如email地址、主機名、狀態碼和標籤。如果字段需要進行過濾(比如查找已發佈博客中status屬性爲published的文章)、排序、聚合。keyword類型的字段只能通過精確值搜索到。

1.4 數字類型

對於數字類型,ELasticsearch支持以下幾種:

類型 取值範圍
long -2^63至2^63-1
integer -2^31至2^31-1
short -32,768至32768
byte -128至127
double 64位雙精度IEEE 754浮點類型
float 32位單精度IEEE 754浮點類型
half_float 16位半精度IEEE 754浮點類型
scaled_float 縮放類型的的浮點數(比如價格只需要精確到分,price爲57.34的字段縮放因子爲100,存起來就是5734)

對於float、half_float和scaled_float,-0.0和+0.0是不同的值,使用term查詢查找-0.0不會匹配+0.0,同樣range查詢中上邊界是-0.0不會匹配+0.0,下邊界是+0.0不會匹配-0.0。

對於數字類型的數據,選擇以上數據類型的注意事項:

  1. 在滿足需求的情況下,儘可能選擇範圍小的數據類型。比如,某個字段的取值最大值不會超過100,那麼選擇byte類型即可。迄今爲止吉尼斯記錄的人類的年齡的最大值爲134歲,對於年齡字段,short足矣。字段的長度越短,索引和搜索的效率越高。
  2. 優先考慮使用帶縮放因子的浮點類型。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

1.5 Object類型

JSON天生具有層級關係,文檔會包含嵌套的對象:

PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

上面的文檔中,整體是一個JSON,JSON中包含一個manager,manager又包含一個name。最終,文檔會被索引成一平的key-value對:

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

上面文檔結構的Mapping如下:

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "region": {
          "type": "keyword"
        },
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { 
              "properties": {
                "first": { "type": "text" },
                "last":  { "type": "text" }
              }
            }
          }
        }
      }
    }
  }
}

1.6 date類型

JSON中沒有日期類型,所以在ELasticsearch中,日期類型可以是以下幾種:

  1. 日期格式的字符串:e.g. “2015-01-01” or “2015/01/01 12:10:30”.
  2. long類型的毫秒數( milliseconds-since-the-epoch)
  3. integer的秒數(seconds-since-the-epoch)

日期格式可以自定義,如果沒有自定義,默認格式如下:

"strict_date_optional_time||epoch_millis"

 

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type": "date" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "date": "2015-01-01" } 

PUT my_index/my_type/2
{ "date": "2015-01-01T12:10:30Z" } 

PUT my_index/my_type/3
{ "date": 1420070400001 } 

GET my_index/_search
{
  "sort": { "date": "asc"} 
}

 

查看三個日期類型:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "date": "2015-01-01"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "date": 1420070400001
        }
      }
    ]
  }
}

排序結果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": null,
        "_source": {
          "date": "2015-01-01"
        },
        "sort": [
          1420070400000
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": null,
        "_source": {
          "date": 1420070400001
        },
        "sort": [
          1420070400001
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": null,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        },
        "sort": [
          1420114230000
        ]
      }
    ]
  }
}

1.7 Array類型

ELasticsearch沒有專用的數組類型,默認情況下任何字段都可以包含一個或者多個值,但是一個數組中的值要是同一種類型。例如:

  1. 字符數組: [ “one”, “two” ]
  2. 整型數組:[1,3]
  3. 嵌套數組:[1,[2,3]],等價於[1,2,3]
  4. 對象數組:[ { “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]

注意事項:

  • 動態添加數據時,數組的第一個值的類型決定整個數組的類型
  • 混合數組類型是不支持的,比如:[1,”abc”]
  • 數組可以包含null值,空數組[ ]會被當做missing field對待。

1.8 binary類型

binary類型接受base64編碼的字符串,默認不存儲也不可搜索。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

搜索blog字段:

GET my_index/_search
{
  "query": {
    "match": {
      "blob": "test" 
    }
  }
}

返回結果:
{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Binary fields do not support searching",
        "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
        "index": "my_index"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my_index",
        "node": "3dQd1RRVTMiKdTckM68nPQ",
        "reason": {
          "type": "query_shard_exception",
          "reason": "Binary fields do not support searching",
          "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
          "index": "my_index"
        }
      }
    ]
  },
  "status": 400
}

Base64加密、解碼工具:http://www1.tc711.com/tool/BASE64.htm

1.9 ip類型

ip類型的字段用於存儲IPV4或者IPV6的地址。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "ip_addr": "192.168.1.1"
}

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

1.10 range類型

range類型支持以下幾種:

類型 範圍
integer_range -2^31至2^31-1
float_range 32-bit IEEE 754
long_range -2^63至2^63-1
double_range 64-bit IEEE 754
date_range 64位整數,毫秒計時

range類型的使用場景:比如前端的時間選擇表單、年齡範圍選擇表單等。 
例子:

PUT range_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH🇲🇲ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

PUT range_index/my_type/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}

上面代碼創建了一個range_index索引,expected_attendees的人數爲10到20,時間是2015-10-31 12:00:00至2015-11-01。

查詢:

POST range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-08-01",
        "lte" : "2015-12-01",
        "relation" : "within" 
      }
    }
  }
}

查詢結果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "range_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "expected_attendees": {
            "gte": 10,
            "lte": 20
          },
          "time_frame": {
            "gte": "2015-10-31 12:00:00",
            "lte": "2015-11-01"
          }
        }
      }
    ]
  }
}

1.11 nested類型

nested嵌套類型是object中的一個特例,可以讓array類型的Object獨立索引和查詢。 使用Object類型有時會出現問題,比如文檔 my_index/my_type/1的結構如下:

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

 

user字段會被動態添加爲Object類型。 
最後會被轉換爲以下平整的形式:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

 

user.first和user.last會被平鋪爲多值字段,Alice和White之間的關聯關係會消失。上面的文檔會不正確的匹配以下查詢(雖然能搜索到,實際上不存在Alice Smith):

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

使用nested字段類型解決Object類型的不足:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}

1.12token_count類型

token_count用於統計詞頻:


PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "length": { 
              "type":     "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "name": "John Smith" }

PUT my_index/my_type/2
{ "name": "Rachel Alice Williams" }

GET my_index/_search
{
  "query": {
    "term": {
      "name.length": 3 
    }
  }
}

1.13 geo point 類型

地理位置信息類型用於存儲地理位置信息的經緯度:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my_index/my_type/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" 
}

PUT my_index/my_type/3
{
  "text": "Geo-point as a geohash",
  "location": "drm3btev3e86" 
}

PUT my_index/my_type/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}

GET my_index/_search
{
  "query": {
    "geo_bounding_box": { 
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

二、Meta-Fields(元數據)

2.1 _all

_all字段是把其它字段拼接在一起的超級字段,所有的字段用空格分開,_all字段會被解析和索引,但是不存儲。當你只想返回包含某個關鍵字的文檔但是不明確地搜某個字段的時候就需要使用_all字段。 
例子:

PUT my_index/blog/1 
{
  "title":    "Master Java",
  "content":     "learn java",
  "author": "Tom"
}

_all字段包含:[ “Master”, “Java”, “learn”, “Tom” ]

搜索:

GET my_index/_search
{
  "query": {
    "match": {
      "_all": "Java"
    }
  }
}

返回結果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.39063013,
    "hits": [
      {
        "_index": "my_index",
        "_type": "blog",
        "_id": "1",
        "_score": 0.39063013,
        "_source": {
          "title": "Master Java",
          "content": "learn java",
          "author": "Tom"
        }
      }
    ]
  }
}

使用copy_to自定義_all字段:

PUT myindex
{
  "mappings": {
    "mytype": {
      "properties": {
        "title": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "content": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "full_content": {
          "type":    "text"
        }
      }
    }
  }
}

PUT myindex/mytype/1
{
  "title": "Master Java",
  "content": "learn Java"
}

GET myindex/_search
{
  "query": {
    "match": {
      "full_content": "java"
    }
  }
}

2.2 _field_names

_field_names字段用來存儲文檔中的所有非空字段的名字,這個字段常用於exists查詢。例子如下:

PUT my_index/my_type/1
{
  "title": "This is a document"
}

PUT my_index/my_type/2?refresh=true
{
  "title": "This is another document",
  "body": "This document has a body"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "body" ] 
    }
  }
}

結果會返回第二條文檔,因爲第一條文檔沒有title字段。 
同樣,可以使用exists查詢:

GET my_index/_search
{
    "query": {
        "exists" : { "field" : "body" }
    }
}

2.3 _id

每條被索引的文檔都有一個_type和_id字段,_id可以用於term查詢、temrs查詢、match查詢、query_string查詢、simple_query_string查詢,但是不能用於聚合、腳本和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_id": [ "1", "2" ] 
    }
  }
}

2.4 _index

多索引查詢時,有時候只需要在特地索引名上進行查詢,_index字段提供了便利,也就是說可以對索引名進行term查詢、terms查詢、聚合分析、使用腳本和排序。

_index是一個虛擬字段,不會真的加到Lucene索引中,對_index進行term、terms查詢(也包括match、query_string、simple_query_string),但是不支持prefix、wildcard、regexp和fuzzy查詢。

舉例,2個索引2條文檔


PUT index_1/my_type/1
{
  "text": "Document in index 1"
}

PUT index_2/my_type/2
{
  "text": "Document in index 2"
}

對索引名做查詢、聚合、排序並使用腳本新增字段:

GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": {
        "lang": "painless",
        "inline": "doc['_index']" 
      }
    }
  }
}

 

2.4 _meta

忽略

2.5 _parent

_parent用於指定同一索引中文檔的父子關係。下面例子中現在mapping中指定文檔的父子關係,然後索引父文檔,索引子文檔時指定父id,最後根據子文檔查詢父文檔。

PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}


PUT my_index/my_parent/1 
{
  "text": "This is a parent document"
}

PUT my_index/my_child/2?parent=1 
{
  "text": "This is a child document"
}

PUT my_index/my_child/3?parent=1&refresh=true 
{
  "text": "This is another child document"
}


GET my_index/my_parent/_search
{
  "query": {
    "has_child": { 
      "type": "my_child",
      "query": {
        "match": {
          "text": "child document"
        }
      }
    }
  }
}

2.6 _routing

路由參數,ELasticsearch通過以下公式計算文檔應該分到哪個分片上:

shard_num = hash(_routing) % num_primary_shards

 

默認的_routing值是文檔的_id或者_parent,通過_routing參數可以設置自定義路由。例如,想把user1發佈的博客存儲到同一個分片上,索引時指定routing參數,查詢時在指定路由上查詢:

PUT my_index/my_type/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}

GET my_index/my_type/1?routing=user1

在查詢的時候通過routing參數查詢:

GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  }
}

GET my_index/_search?routing=user1,user2 
{
  "query": {
    "match": {
      "title": "document"
    }
  }
}

 

在Mapping中指定routing爲必須的:

PUT my_index2
{
  "mappings": {
    "my_type": {
      "_routing": {
        "required": true 
      }
    }
  }
}

PUT my_index2/my_type/1 
{
  "text": "No routing value provided"
}

 

2.7 _source

存儲的文檔的原始值。默認_source字段是開啓的,也可以關閉:

PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}

 

但是一般情況下不要關閉,除法你不想做一些操作:

  • 使用update、update_by_query、reindex
  • 使用高亮
  • 數據備份、改變mapping、升級索引
  • 通過原始字段debug查詢或者聚合

2.8 _type

每條被索引的文檔都有一個_type和_id字段,可以根據_type進行查詢、聚合、腳本和排序。例子如下:

PUT my_index/type_1/1
{
  "text": "Document with type 1"
}

PUT my_index/type_2/2?refresh=true
{
  "text": "Document with type 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": {
        "lang": "painless",
        "inline": "doc['_type']" 
      }
    }
  }
}

2.9 _uid

_uid和_type和_index的組合。和_type一樣,可用於查詢、聚合、腳本和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2?refresh=true
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_uid": [ "my_type#1", "my_type#2" ] 
    }
  },
  "aggs": {
    "UIDs": {
      "terms": {
        "field": "_uid", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_uid": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "UID": {
      "script": {
         "lang": "painless",
         "inline": "doc['_uid']" 
      }
    }
  }
}

三、Mapping參數

3.1 analyzer

指定分詞器(分析器更合理),對索引和查詢都有效。如下,指定ik分詞的配置:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}

3.2 normalizer

normalizer用於解析前的標準化配置,比如把所有的字符轉化爲小寫等。例子:

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/type/1
{
  "foo": "BÀR"
}

PUT index/type/2
{
  "foo": "bar"
}

PUT index/type/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

BÀR經過normalizer過濾以後轉換爲bar,文檔1和文檔2會被搜索到。

3.3 boost

boost字段用於設置字段的權重,比如,關鍵字出現在title字段的權重是出現在content字段中權重的2倍,設置mapping如下,其中content字段的默認權重是1.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "boost": 2 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

 

同樣,在查詢時指定權重也是一樣的:

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

推薦在查詢時指定boost,第一中在mapping中寫死,如果不重新索引文檔,權重無法修改,使用查詢可以實現同樣的效果。

3.4 coerce

coerce屬性用於清除髒數據,coerce的默認值是true。整型數字5有可能會被寫成字符串“5”或者浮點數5.0.coerce屬性可以用來清除髒數據:

  • 字符串會被強制轉換爲整數
  • 浮點數被強制轉換爲整數

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "number_one": "10" 
}

PUT my_index/my_type/2
{
  "number_two": "10" 
}

mapping中指定number_one字段是integer類型,雖然插入的數據類型是String,但依然可以插入成功。number_two字段關閉了coerce,因此插入失敗。

3.5 copy_to

copy_to屬性用於配置自定義的_all字段。換言之,就是多個字段可以合併成一個超級字段。比如,first_name和last_name可以合併爲full_name字段。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}

3.6 doc_values

doc_values是爲了加快排序、聚合操作,在建立倒排索引的時候,額外增加一個列式存儲映射,是一個空間換時間的做法。默認是開啓的,對於確定不需要聚合或者排序的字段可以關閉。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": { 
          "type":       "keyword"
        },
        "session_id": { 
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}

注:text類型不支持doc_values。

3.7 dynamic

dynamic屬性用於檢測新發現的字段,有三個取值:

  • true:新發現的字段添加到映射中。(默認)
  • flase:新檢測的字段被忽略。必須顯式添加新字段。
  • strict:如果檢測到新字段,就會引發異常並拒絕文檔。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic": false, 
      "properties": {
        "user": { 
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { 
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}

PS:取值爲strict,非布爾值要加引號。

3.8 enabled

ELasticseaech默認會索引所有的字段,enabled設爲false的字段,es會跳過字段內容,該字段只能從_source中獲取,但是不可搜。而且字段可以是任意類型。

PUT my_index
{
  "mappings": {
    "session": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { 
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/session/session_1
{
  "user_id": "kimchy",
  "session_data": { 
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

PUT my_index/session/session_2
{
  "user_id": "jpountz",
  "session_data": "none", 
  "last_updated": "2015-12-06T18:22:13"
}

3.9 fielddata

搜索要解決的問題是“包含查詢關鍵詞的文檔有哪些?”,聚合恰恰相反,聚合要解決的問題是“文檔包含哪些詞項”,大多數字段再索引時生成doc_values,但是text字段不支持doc_values。

取而代之,text字段在查詢時會生成一個fielddata的數據結構,fielddata在字段首次被聚合、排序、或者使用腳本的時候生成。ELasticsearch通過讀取磁盤上的倒排記錄表重新生成文檔詞項關係,最後在Java堆內存中排序。

text字段的fielddata屬性默認是關閉的,開啓fielddata非常消耗內存。在你開啓text字段以前,想清楚爲什麼要在text類型的字段上做聚合、排序操作。大多數情況下這麼做是沒有意義的。

“New York”會被分析成“new”和“york”,在text類型上聚合會分成“new”和“york”2個桶,也許你需要的是一個“New York”。這是可以加一個不分析的keyword字段:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

 

上面的mapping中實現了通過my_field字段做全文搜索,my_field.keyword做聚合、排序和使用腳本。

3.10 format

format屬性主要用於格式化日期:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        }
      }
    }
  }
}

 

更多內置的日期格式:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html

3.11 ignore_above

ignore_above用於指定字段索引和存儲的長度最大值,超過最大值的會被忽略:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 15
        }
      }
    }
  }
}

PUT my_index/my_type/1 
{
  "message": "Syntax error"
}

PUT my_index/my_type/2 
{
  "message": "Syntax error with some long stacktrace"
}

GET my_index/_search 
{
  "size": 0, 
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}

 

mapping中指定了ignore_above字段的最大長度爲15,第一個文檔的字段長小於15,因此索引成功,第二個超過15,因此不索引,返回結果只有”Syntax error”,結果如下:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "messages": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

 

3.12 ignore_malformed

ignore_malformed可以忽略不規則數據,對於login字段,有人可能填寫的是date類型,也有人填寫的是郵件格式。給一個字段索引不合適的數據類型發生異常,導致整個文檔索引失敗。如果ignore_malformed參數設爲true,異常會被忽略,出異常的字段不會被索引,其它字段正常索引。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

PUT my_index/my_type/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

 

上面的例子中number_one接受integer類型,ignore_malformed屬性設爲true,因此文檔一種number_one字段雖然是字符串但依然能寫入成功;number_two接受integer類型,默認ignore_malformed屬性爲false,因此寫入失敗。

3.13 include_in_all

include_in_all屬性用於指定字段是否包含在_all字段裏面,默認開啓,除索引時index屬性爲no。 
例子如下,title和content字段包含在_all字段裏,date不包含。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": { 
          "type": "text"
        },
        "content": { 
          "type": "text"
        },
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}

 

include_in_all也可用於字段級別,如下my_type下的所有字段都排除在_all字段之外,author.first_name 和author.last_name 包含在in _all中:

PUT my_index
{
  "mappings": {
    "my_type": {
      "include_in_all": false, 
      "properties": {
        "title":          { "type": "text" },
        "author": {
          "include_in_all": true, 
          "properties": {
            "first_name": { "type": "text" },
            "last_name":  { "type": "text" }
          }
        },
        "editor": {
          "properties": {
            "first_name": { "type": "text" }, 
            "last_name":  { "type": "text", "include_in_all": true } 
          }
        }
      }
    }
  }
}

3.14 index

index屬性指定字段是否索引,不索引也就不可搜索,取值可以爲true或者false。

3.15 index_options

index_options控制索引時存儲哪些信息到倒排索引中,接受以下配置:

參數 作用
docs 只存儲文檔編號
freqs 存儲文檔編號和詞項頻率
positions 文檔編號、詞項頻率、詞項的位置被存儲,偏移位置可用於臨近搜索和短語查詢
offsets 文檔編號、詞項頻率、詞項的位置、詞項開始和結束的字符位置都被存儲,offsets設爲true會使用Postings highlighter

3.16 fields

fields可以讓同一文本有多種不同的索引方式,比如一個String類型的字段,可以使用text類型做全文檢索,使用keyword類型做聚合和排序。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "city": "New York"
}

PUT my_index/my_type/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

 

3.17 norms

norms參數用於標準化文檔,以便查詢時計算文檔的相關性。norms雖然對評分有用,但是會消耗較多的磁盤空間,如果不需要對某個字段進行評分,最好不要開啓norms。

3.18 null_value

值爲null的字段不索引也不可以搜索,null_value參數可以讓值爲null的字段顯式的可索引、可搜索。例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "status_code": null
}

PUT my_index/my_type/2
{
  "status_code": [] 
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL" 
    }
  }
}

 

文檔1可以被搜索到,因爲status_code的值爲null,文檔2不可以被搜索到,因爲status_code爲空數組,但是不是null。

3.19 position_increment_gap

爲了支持近似或者短語查詢,text字段被解析的時候會考慮此項的位置信息。舉例,一個字段的值爲數組類型:

 "names": [ "John Abraham", "Lincoln Smith"]

 

爲了區別第一個字段和第二個字段,Abraham和Lincoln在索引中有一個間距,默認是100。例子如下,這是查詢”Abraham Lincoln”是查不到的:

PUT my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln" 
            }
        }
    }
}

 

指定間距大於100可以查詢到:

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 101 
            }
        }
    }
}

在mapping中通過position_increment_gap參數指定間距:

PUT my_index
{
  "mappings": {
    "groups": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0 
        }
      }
    }
  }
}

3.20 properties

Object或者nested類型,下面還有嵌套類型,可以通過properties參數指定。

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { 
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

對應的文檔結構:

PUT my_index/my_type/1 
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

 

可以對manager.name、manager.age做搜索、聚合等操作。

GET my_index/_search
{
  "query": {
    "match": {
      "manager.name": "Alice White" 
    }
  },
  "aggs": {
    "Employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "Employee Ages": {
          "histogram": {
            "field": "employees.age", 
            "interval": 5
          }
        }
      }
    }
  }
}

3.21 search_analyzer

大多數情況下索引和搜索的時候應該指定相同的分析器,確保query解析以後和索引中的詞項一致。但是有時候也需要指定不同的分析器,例如使用edge_ngram過濾器實現自動補全。

默認情況下查詢會使用analyzer屬性指定的分析器,但也可以被search_analyzer覆蓋。例子:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", 
          "search_analyzer": "standard" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick Brown Fox" 
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br", 
        "operator": "and"
      }
    }
  }
}

 

3.22 similarity

similarity參數用於指定文檔評分模型,參數有三個:

  • BM25 :ES和Lucene默認的評分模型
  • classic :TF/IDF評分
  • boolean:布爾模型評分 
    例子:
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "default_field": { 
          "type": "text"
        },
        "classic_field": {
          "type": "text",
          "similarity": "classic" 
        },
        "boolean_sim_field": {
          "type": "text",
          "similarity": "boolean" 
        }
      }
    }
  }
}

default_field自動使用BM25評分模型,classic_field使用TF/IDF經典評分模型,boolean_sim_field使用布爾評分模型。

3.23 store

默認情況下,自動是被索引的也可以搜索,但是不存儲,這也沒關係,因爲_source字段裏面保存了一份原始文檔。在某些情況下,store參數有意義,比如一個文檔裏面有title、date和超大的content字段,如果只想獲取title和date,可以這樣:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

GET my_index/_search
{
  "stored_fields": [ "title", "date" ] 
}

 

查詢結果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "fields": {
          "date": [
            "2015-01-01T00:00:00.000Z"
          ],
          "title": [
            "Some short title"
          ]
        }
      }
    ]
  }
}

 

Stored fields返回的總是數組,如果想返回原始字段,還是要從_source中取。

3.24 term_vector

詞向量包含了文本被解析以後的以下信息:

  • 詞項集合
  • 詞項位置
  • 詞項的起始字符映射到原始文檔中的位置。

term_vector參數有以下取值:

參數取值 含義
no 默認值,不存儲詞向量
yes 只存儲詞項集合
with_positions 存儲詞項和詞項位置
with_offsets 詞項和字符偏移位置
with_positions_offsets 存儲詞項、詞項位置、字符偏移位置

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type":        "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

四、動態Mapping

4.1 default mapping

在mapping中使用default字段,那麼其它字段會自動繼承default中的設置。

PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true
      }
    }
  }
}

 

上面的mapping中,default中關閉了all字段,user會繼承_default中的配置,因此user中的all字段也是關閉的,blogpost中開啓_all,覆蓋了_default的默認配置。

default被更新以後,只會對後面新加的文檔產生作用。

4.2 Dynamic field mapping

文檔中有一個之前沒有出現過的字段被添加到ELasticsearch之後,文檔的type mapping中會自動添加一個新的字段。這個可以通過dynamic屬性去控制,dynamic屬性爲false會忽略新增的字段、dynamic屬性爲strict會拋出異常。如果dynamic爲true的話,ELasticsearch會自動根據字段的值推測出來類型進而確定mapping:

JSON格式的數據 自動推測的字段類型
null 沒有字段被添加
true or false boolean類型
floating類型數字 floating類型
integer long類型
JSON對象 object類型
數組 由數組中第一個非空值決定
string 有可能是date類型(開啓日期檢測)、double或long類型、text類型、keyword類型

日期檢測默認是檢測符合以下日期格式的字符串:

[ "strict_date_optional_time","yyyy/MM/dd HH🇲🇲ss Z||yyyy/MM/dd Z"]

例子:

PUT my_index/my_type/1
{
  "create_date": "2015/09/02"
}

GET my_index/_mapping

mapping 如下,可以看到create_date爲date類型:

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "create_date": {
            "type": "date",
            "format": "yyyy/MM/dd HH🇲🇲ss||yyyy/MM/dd||epoch_millis"
          }
        }
      }
    }
  }
}

關閉日期檢測:

PUT my_index
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}

PUT my_index/my_type/1 
{
  "create": "2015/09/02"
}

再次查看mapping,create字段已不再是date類型:

GET my_index/_mapping
返回結果:
{
  "my_index": {
    "mappings": {
      "my_type": {
        "date_detection": false,
        "properties": {
          "create": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

自定義日期檢測的格式:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}

PUT my_index/my_type/1
{
  "create_date": "09/25/2015"
}

開啓數字類型自動檢測:

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

4.3 Dynamic templates

動態模板可以根據字段名稱設置mapping,如下對於string類型的字段,設置mapping爲:

  "mapping": { "type": "long"}

 

但是匹配字段名稱爲long_*格式的,不匹配*_text格式的:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/my_type/1
{
  "long_num": "5", 
  "long_text": "foo" 
}

寫入文檔以後,long_num字段爲long類型,long_text扔爲string類型。

4.4 Override default template

可以通過default字段覆蓋所有索引的mapping配置,例子:

PUT _template/disable_all_field
{
  "order": 0,
  "template": "*", 
  "mappings": {
    "_default_": { 
      "_all": { 
        "enabled": false
      }
    }
  }
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章