Elasticsearch系列---數據建模實戰

概要

本篇以實際案例爲背景，介紹不同技術組件對數據建模的特點，並以ES爲背景，介紹常用的聯合查詢的利弊，最後介紹了一下文件系統分詞器path_hierarchy和嵌套對象的使用。

數據模型對比

實際項目中，電商平臺系統常見的組合Java、Mysql和Elasticsearch，以基礎的部門-員工實體爲案例。

JavaBean類型定義

如果是JavaBean類型，會這樣定義

public class Department {
	private Long id;
	private String name;
	private String desc;
	private List<Long> userIds;
}

public class Employee {
	private Long id;
	private String name;
	private byte gender;
	private Department dept;
}

數據庫模型定義

如果是關係型數據庫(mysql)，會這樣建表

create table t_department (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	desc varchar(80) not null,
	PRIMARY KEY (`id`)
)

create table t_employee (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	gender tinyint(1) not null,
	dept_id bigint(20),
	PRIMARY KEY (`id`)
)

依據數據庫三範式設計表，每個實體設計成獨立的表，用主外鍵約束進行關聯，按照現有的數據表規範，已經不再使用外鍵約束了，外鍵約束放在應用層控制。

ES文檔數據模型

如果es的文檔數據模型，會這樣設計document

{
	"deptId": 1,
	"deptname": "CEO辦公室",
	"desc":"這一個有情懷的CEO",
	"employee":[
		{
			"userId":1,
			"name":Lily,
			"gender":0
		},
		{
			"userId":2,
			"name":Lucy,
			"gender":0
		},
		{
			"userId":3,
			"name":Tom,
			"gender":1
		}
	]
}

es更類似面向對象的數據模型，將所有關聯的數據放在一個document裏。

JOIN查詢

我們以博客網站爲案例背景，建立博客網站中博客與用戶的數據模型。

將用戶與博客分別建立document，分割實體，類似數據庫三範式，並使用關鍵field（userId）建立依賴關係

先建立兩個實體document，放一條示例數據

PUT /blog/user/1
{
  "id":1,
  "username":"Lily",
  "age":18
}

PUT /website/article/1
{
  "title":"my frist blog",
  "content":"this is my first blog, thank you",
  "userId":1
}

需求：要查詢用戶名Lily發表的博客
步驟：1）查詢用戶document，根據名字Lily查詢到它的userId；
2）根據第一步查詢返回的userId，重新組裝請求報文，查詢博客docuement
示例報文：

GET /blog/user/_search
{
  "query": {
    "match": {
      "username.keyword": "Lily"
    }
  }
}

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "userId": [
            "1"
          ]
        }
      }
    }
  }
}

以上步驟叫做應用層Join實現關聯查詢

優點：結構清晰明瞭，數據不冗餘，維護方便。
缺點：應用層join，如關聯的數據過多，查詢性能很低。

適用場景：兩層join，第一層document查詢基本上能做到精準查詢，返回的結果數很少，並且第二層數據量特別大。如案例中的場景，根據名稱找userId，返回的數據相對較少，第二層的查詢性能就比較高，第二層數據屬於業務數據類型，數據量肯定特別大。

適度冗餘減少應用層Join查詢

普通查詢

接上面案例，修改博客document，將username冗餘到該document中，如：

PUT /website/article/2
{
  "title":"my second blog",
  "content":"this is my second blog, thank you",
  "userInfo": {
    "id":1,
    "username":"Lily"
  }
}

查詢時直接指定username:

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "userInfo.username.keyword": "Lily"
        }
      }
    }
  }
}

優點：一次查詢即可，性能較高
缺點：若冗餘的字段有更新，維護非常麻煩

適合場景：適當的冗餘比較有必要，可以減小join查詢，關係型數據庫設計也經常有冗餘數據的優化，只要挑選冗餘字段時要注意儘量選變更可能性小的字段，避免查詢一時爽，更新想撞牆這種事情發生。

數據冗餘設計後聚合分組查詢

造點測試數據進去

PUT /website/article/3
{
  "title":"my third blog",
  "content":"this is my third blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

PUT /website/article/4
{
  "title":"my 4th blog",
  "content":"this is my 4th blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

分組查詢：Lily發表了哪些博客，Lucy發表了哪些博客

GET website/article/_search
{
  "size": 0,
  "aggs": {
    "group_by_username": {
      "terms": {
        "field": "userInfo.username.keyword"
      },
      "aggs": {
        "top_articles": {
          "top_hits": {
            "size": 10,
            "_source": {
              "includes": "title"
            }
          }
        }
      }
    }
  }
}

文件搜索

文件類型的數據有個很大的特點：有目錄層次關係。如果我們有對文件搜索的需求，可以這個建立索引：

PUT /files
{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": {
          "tokenizer":"path_hierarchy"
        }
      }
    }
  }
}

PUT /files/_mapping/file
{
  "properties": {
    "name": {
      "type": "keyword"
    },
    "path": {
      "type": "keyword",
      "fields": {
        "tree": {
          "type": "text",
          "analyzer": "paths"
        }
      }
    }
  }
}

注意分詞器path_hierarchy，會把/opt/data/log分成

/opt/

/opt/data/

/opt/data/log

插入一條測試數據

PUT /files/file/1
{
  "name":"hello.txt",
  "path":"/opt/data/txt/"
}

搜索案例

指定文件名，指定具體路徑搜索

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path": "/opt/data/txt/"
          }
        }
      ]
    }
  }
}

/opt路徑下的hello.txt文件（包含子目錄）

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path.tree": "/opt/"
          }
        }
      ]
    }
  }
}

區別：path與path.tree的使用
path.tree是會分詞的，並且指定分詞器爲path_hierarchy
path不分詞，直接使用。

nested object數據類型

提出問題

用普通的object對象做數據冗餘時，如果冗餘的數據是一個數組集合類的，查詢可能會出問題，例如：博客信息下面的評論，是一個集合類型

PUT /website/article/5
{
  "title": "清茶豆奶發表的一篇技術帖子",
  "content":  "我是清茶豆奶，大家要不要考慮關注一下Java架構社區啊",
  "tags":  [ "IT技術", "Java架構社區" ],
  "comments": [ 
    {
      "name":    "清茶",
      "comment": "有什麼乾貨沒有啊？",
      "age":     29,
      "stars":   4,
      "date":    "2019-10-29"
    },
    {
      "name":    "豆奶",
      "comment": "我最喜歡研究技術，真好",
      "age":     32,
      "stars":   5,
      "date":    "2019-10-30"
    }
  ]
}

需求：查詢被29歲的豆奶用戶評論過的博客

GET /website/article/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "comments.name.keyword": "豆奶"
          }
        },
        {
          "match": {
            "comments.age": "29"
          }
        }
      ]
    }
  }
}

根據這條演示數據，這個條件是查不到結果的，但實際卻查出來這條數據，爲什麼？

原因：object類型底層數據結果，會將json進行扁平化存儲，如上例子，存儲結構將變成：

{
	"title":["清茶","豆奶","發表","一篇","技術","帖子"],
	"content": ["我","清茶","豆奶","大家","要不要","考慮","關注","一下","Java架構社區"],
	tags:["IT技術", "Java架構社區"],
	comments.name:["清茶","豆奶"],
	comments.comment:["有","什麼","乾貨","沒有啊","我","最喜歡","研究","技術","真好"],
	comments.age:[29,32],
	comments.stars:[4,5],
	comments.date:["2019-10-29","2019-10-30"]
}

這樣"豆奶"和29就被命中了，跟預期的結果不一致。

解決辦法

引入nested object類型，就可以解決這種問題。
修改mapping，將comments的類型改成nested object。
先刪掉索引，再重新建立

PUT /website
{
  "mappings": {
    "article": {
      "properties": {
        "comments": {
          "type": "nested",
          "properties": {
            "name": {"type":"text"},
            "comment": {"type":"text"},
            "age":     {"type":"short"},
            "stars":   {"type":"short"},
            "date":  {"type":"date"}
          }
        }
      }
    }
  }
}

這樣底層數據結構就成變成：

{
	"title":["清茶","豆奶","發表","一篇","技術","帖子"],
	"content": ["我","清茶","豆奶","大家","要不要","考慮","關注","一下","Java架構社區"],
	tags:["IT技術", "Java架構社區"],
	comments:[
		{
			"name":"清茶",
			"comment":["有","什麼","乾貨","沒有啊"],
			"age":29,
			"stars":4,
			"date":"2019-10-29"
		},
		{
			"name":"豆奶",
			"comment":["我","最喜歡","研究","技術","真好"],
			"age":32,
			"stars":5,
			"date":"2019-10-30"
		}
	]
}

再查詢結果爲空，符合預期。

聚合查詢示例

求博客每天評論的平均星數

GET /website/article/_search
{
  "size": 0,
  "aggs": {
    "comments_path": {
      "nested": {
        "path": "comments"
      }, 
      "aggs": {
        "group_by_comments_date": {
          "date_histogram": {
            "field": "comments.date",
            "interval": "day",
            "format": "yyyy-MM-dd"
          }, 
          "aggs": {
            "stars_avg": {
              "avg": {
                "field": "comments.stars"
              }
            }
          }
        }
      }
    }
  }
}

響應結果(有刪節)：

{
  "aggregations": {
    "comments_path": {
      "doc_count": 2,
      "group_by_comments_date": {
        "buckets": [
          {
            "key_as_string": "2019-10-29",
            "key": 1572307200000,
            "doc_count": 1,
            "stars_avg": {
              "value": 4
            }
          },
          {
            "key_as_string": "2019-10-30",
            "key": 1572393600000,
            "doc_count": 1,
            "stars_avg": {
              "value": 5
            }
          }
        ]
      }
    }
  }
}

小結

本篇以實際的案例爲主，簡單快速的介紹了實際項目中常用的數據聯合查詢，嵌套對象的使用等，很有實用價值，可以瞭解一下。

專注Java高併發、分佈式架構，更多技術乾貨分享與心得，請關注公衆號：Java架構社區
可以掃左邊二維碼添加好友，邀請你加入Java架構社區微信羣共同探討技術

Elasticsearch系列---數據建模實戰

概要

數據模型對比

JavaBean類型定義

數據庫模型定義

ES文檔數據模型

JOIN查詢

適度冗餘減少應用層Join查詢

普通查詢

數據冗餘設計後聚合分組查詢

文件搜索

搜索案例

nested object數據類型

提出問題

解決辦法

聚合查詢示例

小結

探究職業發展的關鍵：能力模型解讀

高效率使用windows

智能決策新時代：可視化大屏是否能夠超越傳統白板？

解密Prompt系列28. LLM Agent之金融領域摸索：FinMem & FinAgent

分享幾個.NET開源的AI和LLM相關項目框架

學一點Ceph知識：初識Ceph

SpringMVC Json自定義序列化和反序列化

SpringMVC日期格式屬性自動轉成時間戳實現源碼分析

spring-cloud-gateway聚合swagger文檔

基於SpringCloud的enum枚舉值國際化處理實踐

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結