Clickhouse Projection 特性探索

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"年初的clickhouse meetup上快手團隊分享了clickhouse projection在其公司內部的實踐。分享包括了projection原理、使用、性能測試等內容。從性能測試的數據上看,projeciton對查詢性能有着百倍級別的提升,意味着之前分鐘級的查詢響應延遲,將會提升到秒級響應。秒級的查詢響應延遲,將會提升到毫秒級的響應,對於使用者將會有更加完美的體驗。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看完了快手同學的clickhouse projeciton的分享,在我腦中也產生了幾個問題?","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"沒有projection功能之前,clickhouse還存在什麼問題?","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"clickhouse projection如何解決的問題?","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"clickhouse projection適用於哪些場景?","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"clickhouse proejction有什麼要注意的嗎?","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"沒有projection功能之前,clickhouse還存在什麼問題?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clickhosue作爲一款olap引擎,處於數據平臺中的最頂層,直接對接平臺用戶。查詢性能的好壞,直接決定着用戶的使用體驗。","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"clickhouse的查詢性能雖然已經非常完美,但是面對超大數據量的場景還是會存在一定的問題,原因是clickhouse是基於內存計算的MPP架構分析型數據庫,與Spark, Hive, MR等計算框架不同,計算 過程中的臨時數據沒有磁盤選項。查詢過程中,數據會加載到內存中。如果內存配置不夠,將會導致查詢失敗,對clickhouse集羣的穩定也會有一定的影響。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"用戶在數據查詢的場景中,會有着一定的使用習慣。比如,每天定時都會查看一些特定的圖表。這些圖表中包含全量的數據統計,複雜的數據查詢邏輯等。這些查詢相較於其他查詢,可以歸屬於異常查詢。這些查詢可能因爲內存問題導致查詢失敗,也可能因爲複雜的計算邏輯導致查詢時間過長,影響平臺上其他用戶的查詢。","attrs":{}}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"clickhouse projection如何解決的問題?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在OLAP領域中,根據數據模型主要分爲ROLAP(Relational OLAP) 關係OLAP,MOLAP(Multidimension OLAP) 多維OLAP 兩種。ROLAP將數據表達爲二維關係模型,類似關係型數據庫模型,數據表達能力較好,對外提供SQL接口。MOLAP將OLAP分析所用到的多維數據物理上存儲爲多維數組的形式,形成“立方體”的結構。維的屬性值被映射成多維數組的下標值或下標的範圍,而彙總數據作爲多維數組的值存儲在數組的單元中,採用預聚合的思想,加速數據查詢,但是數據模型不夠靈活。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"clickhouse作爲ROLAP典型代表之一,純列式存儲單表查詢性能幾乎沒有對手。projection 名字起源於vertica,相當於傳統意義上的物化視圖。它借鑑 MOLAP 預聚合的思想,在數據寫入的時候,根據 projection 定義的表達式,計算寫入數據的聚合數據同原始數據一併寫入。數據查詢的過程中,如果查詢SQL通過分析可以通過聚合數據得出,直接查詢聚合數據減少計算的開銷,解決了由於數據量導致的內存問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"projeciton 底層存儲上屬於part目錄下數據的擴充,可以理解爲查詢索引的一種形式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從數據寫入邏輯的核心代碼上看(clickhouse version 21.7),多個projection在part目錄下以多個子目錄存儲,projection目錄下存儲基於原始數據聚合的數據。所以,projection寫入與原始數據寫入同步,只有創建projection之後寫入的數據纔會被物化,保證數據的一致性。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"MergeTreeDataWrite.cpp.390\n\n如果存在projection配置,將projection part添加new_data_part中。\nif (metadata_snapshot->hasProjections())\n{\n for (const auto & projection : metadata_snapshot->getProjections())\n {\n /// 1. 獲取projection query的執行計劃。\n /// 2. 當前Block作爲輸入,計算聚合結果\n /// 3. 獲取數據流\n auto in = InterpreterSelectQuery(\n projection.query_ast,\n context,\n Pipe(std::make_shared(block, Chunk(block.getColumns(), block.rows()))),\n SelectQueryOptions{\n projection.type == ProjectionDescription::Type::Normal ? QueryProcessingStage::FetchColumns : QueryProcessingStage::WithMergeableState})\n .execute()\n .getInputStream();\n in = std::make_shared(in, block.rows(), std::numeric_limits::max());\n in->readPrefix();\n // 4. 讀取prjeciton計算的數據塊\n auto projection_block = in->read();\n if (in->read())\n throw Exception(\"Projection cannot grow block rows\", ErrorCodes::LOGICAL_ERROR);\n in->readSuffix();\n if (projection_block.rows())\n {\n // 5. 將聚合的數據(.proj)添加到new_data_part中\n new_data_part->addProjectionPart(projection.name, writeProjectionPart(projection_block, projection, new_data_part.get()));\n }\n }\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從文件系統目錄上看,p2.proj 爲data part下p2 projection的數據目錄,目錄下聚合列,聚合函數作爲單獨的列存文件存儲。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"\n├── dim1.bin \n├── dim1.mrk2 \n├── dim2.bin \n├── dim2.mrk2 \n├── dim3.bin \n├── dim3.mrk2 \n├── event_key.bin \n├── event_key.mrk2 \n├── event_time.bin \n├── event_time.mrk2 \n├── p2.proj \n│ ├── checksums.txt \n│ ├── columns.txt \n│ ├── count%28%29.bin \n│ ├── count%28%29.mrk2 \n│ ├── count.txt \n│ ├── default_compression_codec.txt \n│ ├── dim3.bin \n│ ├── dim3.mrk2 \n│ ├── groupBitmap%28user%29.bin \n│ ├── groupBitmap%28user%29.mrk2 \n│ └── primary.idx ","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"clickhouse projection適用於哪些場景?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了探索projection適用於哪些場景,準備了典型的用戶行爲數據集,數據量爲1億條, 數據模型選擇事件模型,模型中包含了用戶做過什麼事件,以及事件對應的維度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"維度選擇上,dim1,dim2爲普通維度值,維度值種類有10種。dim3爲高基維維度,維度值種類有100000種。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/4a/4a16e1e8d3c639e5597218b251fa72e2.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
user
唯一用戶標識
event_key
事件時間
event_time
事件時間
dim1
普通維度
dim2
普通維度
dim3
高基維度
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"如何爲數據表構建projection?","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"建表的時候指定多個projection 定義,projection中爲基本的select語句,可以省略from table子句,默認與源表保持一致。","attrs":{}}]}]}]},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"CREATE TABLE event_projection1 \n( \n `event_key` String, \n `user` UInt32, \n `event_time` DateTime64(3, 'Asia/Shanghai'), \n `dim1` String, \n `dim2` String, \n `dim3` String, \n PROJECTION p1 \n ( \n SELECT \n groupBitmap(user), \n count(1) \n GROUP BY dim1 \n ) \n) \nENGINE = MergeTree() \nORDER BY (event_key, user, event_time) ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. alter table 語句補充projection定義","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"sql"},"content":[{"type":"text","text":"ALTER TABLE event_projection1 \n ADD PROJECTION p2 \n ( \n SELECT \n count(1), \n groupBitmap(user) \n GROUP BY dim1, dim3 \n ) ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"怎麼查詢才能命中projection?","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"select表達式必須爲projection定義中select 表達式的子集。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"group by clause必須爲projection定義中group by clause的子集。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"where clause key必須爲projeciton定義中的group by column的子集。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"如何知道是否命中了projection?","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"explain查看執行計劃,ReadFromStorage (MergeTree(with projection)) 表示命中projection","attrs":{}}]}]}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"EXPLAIN SQL \nexpain actions=1 select dim, count(1) from event_projection group by dim1 \n \n \n執行計劃: \nExpression ((Projection + Before ORDER BY)) \nActions: INPUT :: 0 -> dim1 String : 0 \n INPUT :: 1 -> count() UInt64 : 1 \nPositions: 0 1 \n SettingQuotaAndLimits (Set limits and quota after reading from storage) \n ReadFromStorage (MergeTree(with projection)) ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. clickouse 查詢關鍵日誌","attrs":{}}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"查詢命中了projection p \n(SelectExecutor): Choose aggregate projection p \n(SelectExecutor): projection required columns: dim1, count() \n(SelectExecutor): Reading approx. 63 rows with 4 streams ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"查詢效果如何?","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/db/dba6396d038dcc95068030b160a618fe.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
projection定義
查詢耗時
存儲
插入時間
無projection
5.347s
650M
7min
dim1聚合
0.018s
654M
12min
(dim1 + dim3) 聚合
0.319s
923M
20min
"}}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"命中projection相比沒有命中projection對於查詢性能的提升非常明顯。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"構建projection對於存儲,數據插入有一定的額外開銷。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"如果構建projection的時候混入了高基維度,查詢耗時相比沒有混入高基維度,查詢性能同比降低了近200倍,存儲與插入時間也付出了更多的額外開銷。","attrs":{}}]}]}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
場景: 不同聚合函數性能提升效果對比
"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b0143afca11ac714832d7f0db46d2d66.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"
聚合函數
沒有projection
普通維度聚合
高基維度聚合
count(1)
5.347s
0.018s
0.319s
groupBitmap(user)
7.936s
0.040s
5.840s
"}}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"相同的條件下groupBitmap沒有count聚合函數的性能提升效果好,","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"高基維的場景下,即使命中了projection與沒有命中projection,查詢效果幾乎相同, 而且付出了額外的存儲計算開銷。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"綜上以上測試可以得出,高基維度對於projection特性並不友好,查詢性能提升有限,並且還有付出不小的額外開銷,不建議projection構建的時候應用高基維度。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"clickhouse projection有什麼要注意的嗎?","attrs":{}}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"額外的存儲開銷","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上面有提到,每個projection在part目錄下存到單獨的目錄獨立存儲,projection目錄下存儲基於原始數據計算的聚合數據。projection數據可以抽象理解爲一張聚合表,按照不同的維度聚合,聚合度不同,projection的存儲開銷也會同。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2. 影響數據寫入速度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過源代碼分析可以發現, projection寫入與原始數據寫入過程保持一致。每一份數據part寫入都會基於原始數據Block結合projection定義計算聚合數據,增加了數據寫入的額外開銷,也增加了數據寫入的時間,降低了數據的時效性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3. 歷史數據不會自動物化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"projection基於part粒度存儲,並且與數據寫入保證一致,創建了projection之後插入的數據纔會被物化。同時,part之間的merge包含projection之間的merge,如果part之間的projeciton定義不一致,將會導致part merge失敗,可以通過projection materialization操作將part中projection數據拉齊。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"projection materialization: projection計算基於原始數據block,對於比較大part計算的過程中很容易出現內存問題。可以構建insert select pipeline模擬新數據產生的過程,中間會生成多個臨時小part,小part中的proejction進行多段merge。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"4. part過多導致projection不能命中","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據查詢命中projeciton其中的一個條件爲50%以上的part 覆蓋projection。存在部分場景,由於數據頻繁寫入,導致生成很多小part,part數量增加增大了計算覆蓋率的分母,導致沒有達成命中projection的條件。但是,伴隨着part的合併part數量的減少,之後的查詢有可能命中projection。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇文章只是針對clickhouse proejction特性進行了簡單的介紹,並進行了基礎的性能測試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在性能測試中,也發現了高基維度對於clickhouse projection的影響。後續將會其他文章對 clickhouse 的查詢流程,底層存儲進行細緻的詳解,分析其影響的內部原因。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章