選擇適合自己的 OLAP 引擎

原創

程序员小陶

2020-05-15 09:03

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"摘要："},{"type":"text","text":"本文主要介紹了主流開源的OLAP引擎：Hive、Sparksql、Presto、Kylin、Impala、Druid、Clickhouse 等，逐一介紹了每一款開源 OLAP 引擎，包含架構、優缺點、使用場景等，希望可以給大家有所啓發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PS: 文章較長，建議收藏慢慢看。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說起 OLAP 要追溯到 1993 年。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則1 OLAP模型必須提供多維概念視圖"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則2 透明性準則"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則3 存取能力準則"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則4 穩定的報表能力"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則5 客戶/服務器體系結構"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則6 維的等同性準則"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則7 動態的稀疏矩陣處理準則"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則8 多用戶支持能力準則"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則9 非受限的跨維操作"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則10 直觀的數據操縱"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則11 靈活的報表生成"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"準則12 不受限的維與聚集層次"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OLAP場景的關鍵特徵"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大多數是讀請求"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據總是以相當大的批(> 1000 rows)進行寫入"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不修改已添加的數據"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每次查詢都從數據庫中讀取大量的行，但是同時又僅需要少量的列"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寬表，即每個表包含着大量的列"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"較少的查詢(通常每臺服務器每秒數百個查詢或更少)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於簡單查詢，允許延遲大約50毫秒"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"列中的數據相對較小：數字和短字符串(例如，每個URL 60個字節)"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"處理單個查詢時需要高吞吐量（每個服務器每秒高達數十億行）"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"事務不是必須的"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對數據一致性要求低"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每一個查詢除了一個大表外都很小"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查詢結果明顯小於源數據，換句話說，數據被過濾或聚合後能夠被盛放在單臺服務器的內存中"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"與OLAP 不同的是，OLTP系統強調數據庫內存效率，強調內存各種指標的命令率，強調綁定變量，強調併發操作，強調事務性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"OLAP系統則強調數據分析，強調SQL執行時長，強調磁盤I/O，強調分區。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e4/e4eaefea077ffbc3b8f311b535fb19dc.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"OLAP開源引擎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前市面上主流的開源OLAP引擎包含不限於：Hive、Spark SQL、Presto、Kylin、Impala、Druid、Clickhouse、Greeplum等，可以說目前沒有一個引擎能在數據量，靈活程度和性能上做到完美，用戶需要根據自己的需求進行選型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"從事數據開發工作的小夥伴，大概率用過以上的幾種甚至全部。所以下面就開門見山了，默認大家熟悉大數據的專業名詞和生態環境。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Hive hive.apache.org"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive 是基於 Hadoop 的一個數據倉庫工具，可以將結構化的數據文件映射爲一張數據庫表，並提供完整的 sql 查詢功能，可以將 sql 語句轉換爲 MapReduce 任務進行運行。其優點是學習成本低，可以通過類SQL語句快速實現簡單的MapReduce統計，不必開發專門的MapReduce應用，十分適合數據倉庫的統計分析。 "}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Spark SQL spark.apache.org/sql"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SparkSQL的前身是Shark，它將 SQL 查詢與 Spark 程序無縫集成,可以將結構化數據作爲 Spark 的 RDD 進行查詢。SparkSQL作爲Spark生態的一員繼續發展，而不再受限於Hive，只是兼容Hive。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"幾點說明："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1）Spark SQL的應用並不侷限於SQL；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2）訪問hive、json、parquet等文件的數據；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3）SQL只是Spark SQL的一個功能而已；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4）Spark SQL這個名字起的並不恰當；"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark SQL在整個Spark體系中的位置如下"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/74/74e9f18d07596633327d47bdd8266353.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8eb394b557e37545e643961d60c9c48.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"看圖說話，分成三個部分，第一部分是前端的，第二部分是後端的，對三個部分是中間的Catalyst，這個Catalyst是整個架構的核心。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於架構的流程總結，下面引用知乎@ysiwgtus 的內容"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}]},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"1、首先我們看前端。前端有不同種的訪問方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"1）典型的我們可以使用hive，你hive過來就是一個SQL語句，SQL語句就是一個字符串，那麼這個字符串如何才能夠被Catalyst進行解析呢，或者說如何將一個SQL語句翻譯成spark的作業呢，他要經過解析的，有一個抽象語法樹，這是第一種訪問方式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"2）第二種訪問方式，我們可以通過spark的應用程序，編程的方式來操作，編程的時候我們可以使用SQL，也可以使用dataframe或者是dataset api。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"3）第三種是Streaming SQL，也就是說流和SQL綜合起來使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"2、我們看Catalyst"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"1）前端三個訪問方式，當前端過來以後他首先會生成一個Unresolved Logical Plan，也就是一個沒有徹底解析完的一個執行計劃，這個執行計劃會和我們的元數據，也就是metastore裏面的schema一個表信息進行整合然後生成一個Logical Plan（邏輯執行計劃）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"2）那麼這個邏輯執行計劃是最原始的，中間還會做各種優化也很多規則作用上去，也就是圖中的Optimization Rules，然後進行優化以後生成優化過後的邏輯執行計劃，就是圖中的Optimized Logical Plan。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"3）那麼邏輯執行計劃生成完了以後，纔會生成物理執行計劃，也就是我們spark的一個作業。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"那麼從你的SQL語句解析成抽象語法樹之後後續的部分全部交給Catalyst來完成，包括你邏輯執行計劃的生成，邏輯執行計劃的優化都是由Catalyst完成的，我們再回顧一下shark，他的解析然後邏輯執行計劃的生成和優化全部都是依賴於hive的，那麼這就是sparkSQL和hive典型的一個區別從抽象語法樹之後，也就是圖上AST之後完全由sparkSQL裏的Catalyst接管以後，由他來生成物理執行計劃，並最終提交到生產上面去運行就行了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":null},"content":[{"type":"text","text":"3、以上就是sparkSQL架構的整體的流程，這個流程當中主要有幾個部分，語法樹、邏輯執行計劃、優化之後的邏輯執行計劃、物理執行計劃。如果熟悉SQL的執行流程或者瞭解hive的SQL語句是怎麼樣從SQL翻譯成mapreduce作業的話，那麼其實你會看出來整個流程都是非常相似的，那麼在SQL on hadoop框架裏面的那麼多框架，只要是基於SQL的，他的大概流程都是這樣子的，從SQL解析過後成爲一個抽象語法樹，然後再到了邏輯執行計劃，然後邏輯執行計劃優化，再到物理執行計劃，再到物理執行計劃的優化，最終生成你對應框架的作業，有可能是mapreduce作業，可能是spark作業，提交到對應的集羣上運行就可以了。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Presto prestodb.io"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/58/5849f91a645e755e39e511ccf5e3f729.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto支持標準的ANSI SQL，包括複雜查詢、聚合（aggregation）、連接（join）和窗口函數（window functions)。作爲Hive和Pig（Hive和Pig都是通過MapReduce的管道流來完成HDFS數據的查詢）的替代者，Presto 本身並不存儲數據，但是可以接入多種數據源，並且支持跨數據源的級聯查詢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto沒有使用MapReduce，它是通過一個定製的查詢和執行引擎來完成的。它的所有的查詢處理是在內存中，這也是它的性能很高的一個主要原因。Presto和Spark SQL有很大的相似性，這是它區別於Hive的最根本的區別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Presto由於是基於內存的，而 Hive 是在磁盤上讀寫的，因此 presto 比hive快很多，但是由於是基於內存的計算當多張大表關聯操作時易引起內存溢出錯誤。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Apache Kylin™ kylin.apache.org/cn"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Kylin™是一個開源的、分佈式的分析型數據倉庫，提供Hadoop/Spark 之上的 SQL 查詢接口及多維分析（OLAP）能力以支持超大規模數據，最初由 eBay 開發並貢獻至開源社區。它能在亞秒內查詢巨大的表。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin 提供與多種數據可視化工具的整合能力，如 Tableau，PowerBI 等，令用戶可以使用 BI 工具對 Hadoop 數據進行分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bdc0ece1a1fe743883f42b584f3bfd65.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單的講解一下上面的架構圖，以Hive或者Kafka作爲數據源，裏面保存着真實表，而Kylin做的就是將數據進行抽象，通過引擎實現Cube的構建。將Hbase作爲數據的倉庫，存放Cube。因爲Hbase的直接讀取比較複雜，所以Kylin提供了近似SQL和HQL的形式，滿足了數據讀取的基本需求。對外提供了RestApi和JDBC/ODBC方便操作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d7d5c93757f93a0a7aa2e6c725b149e.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kylin自身就是一個MOLAP系統，多維立方體（MOLAP Cube）的設計使得用戶能夠在Kylin裏爲百億以上數據集定義數據模型並構建立方體進行數據的預聚合。立方體的設計，我的理解是就是"},{"type":"text","marks":[{"type":"strong"}],"text":"以空間換時間"},{"type":"text","text":"，通過定義一系列的緯度，對每個緯度的組合進行預先計算並存儲。有N個緯度，就會有2的N次種組合。所以最好事先控制好緯度的數量，因爲存儲量會隨着緯度的增加爆炸式的增長，產生災難性後果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Impala impala.apache.org"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Impala 是 Cloudera 公司推出，提供對 HDFS、Hbase 數據的高性能、低延遲的交互式 SQL 查詢功能。Impala 使用 Hive的元數據, 完全在內存中計算。是CDH 平臺首選的 PB 級大數據實時查詢分析引擎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/92/920438490db3a3b609aab38bac0a6408.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7a/7a8dd5afb6e780980e3e44891013de28.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、基於內存進行計算，能夠對PB級數據進行交互式實時查詢、分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、無需轉換爲MR，直接讀取HDFS及Hbase數據 ,從而大大降低了延遲。"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Impala沒有MapReduce批處理，而是通過使用與商用並行關係數據庫中類似的分佈式查詢引擎（由Query Planner、Query Coordinator和Query Exec Engine三部分組成"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、C++編寫，LLVM統一編譯運行"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在底層對硬件進行優化， LLVM：編譯器，比較穩定，效率高"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、兼容HiveSQL"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持hive基本的一些查詢等，hive中的一些複雜結構是不支持的"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5、具有數據倉庫的特性，可對hive數據直接做數據分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6、支持Data Local"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據本地化：無需數據移動，減少數據的傳輸"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7、支持列式存儲"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以和Hbase整合：因爲Hive可以和Hbasez整合"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"8、支持JDBC/ODBC遠程訪問"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Impala劣勢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、對內存依賴大"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只在內存中計算，官方建議128G(一般64G基本滿足)，可優化: 各個節點彙總的節點(服務器)內存選用大的，不彙總節點可小點"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、C++編寫開源？"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於java, C++可能不是很瞭解"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、完全依賴hive"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、實踐過程中分區超過1w 性能嚴重下下降"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"定期刪除沒有必要的分區，保證分區的個數不要太大"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5、穩定性不如hive"}]},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因完全在內存中計算，內存不夠，會出現問題, hive內存不夠，可使用外存"}]}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Impala不提供任何對序列化和反序列化的支持。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Impala只能讀取文本文件，而不能讀取自定義二進制文件。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每當新的記錄/文件被添加到HDFS中的數據目錄時，該表需要被刷新。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Druid druid.apache.org"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說起 Druid，大家首先想到的是阿里的 Druid 數據庫連接池，而本文介紹的 Druid 是一個在大數據場景下的解決方案，是需要在複雜的海量數據下進行交互式實時數據展現的 BI/OLAP 工具。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b1/b1517e5f284670ff54bb6b9c6ee47b1e.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Druid 的架構是 Lambda 架構，分成實時層( Overlord、 MiddleManager )和批處理層( Broker 和 Historical )。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更多關於架構的描述，可以看官方文檔或者《Druid在有讚的實踐》"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前 Druid 廣泛應用在國內外各個公司，比如阿里，滴滴，知乎，360，eBay，Hulu 等。Druid 之所以能夠在 OLAP 家族中佔據一席之地，主要依賴其強大的 MPP 架構設計。初次之外，它還運用到了四點重要的技術，分別是：預聚合、列式存儲、字典編碼、位圖索引。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"常見的應用場景：（https://druid.apache.org/use-cases）"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點擊流分析（網絡和移動分析）"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"風險/欺詐分析"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡遙測分析（網絡性能監控）"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務器指標存儲"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"供應鏈分析（製造指標）"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"應用程序性能指標"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"商業智能/ OLAP"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Druid的核心設計結合了數據倉庫，時間序列數據庫和搜索系統的思想，以創建一個統一的系統，用於針對各種用例的實時分析。Druid將這三個系統中每個系統的關鍵特徵合併到其接收層，存儲格式，查詢層和核心體系結構中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"（https://druid.apache.org/technology）"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"什麼樣的業務適合用 Druid?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"建議如下："}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時序化數據：Druid 可以理解爲時序數據庫，所有的數據必須有時間字段。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實時數據接入可容忍丟數據(tranquility)：目前 tranquility 有丟數據的風險，所以建議實時和離線一起用，實時接當天數據，離線第二天把今天的數據全部覆蓋，保證數據完備性。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OLAP 查詢而不是 OLTP 查詢：Druid 查詢併發有限，不適合 OLTP 查詢。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"非精確的去重計算：目前 Druid 的去重都是非精確的。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無 Join 操作：Druid 適合處理星型模型的數據，不支持關聯操作。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據沒有 update 更新操作，只對 segment 粒度進行覆蓋：由於時序化數據的特點，Druid 不支持數據的更新。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Clickhouse clickhouse.tech"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Clickhouse 由俄羅斯 yandex 公司開發。專爲在線數據分析而設計。Yandex是俄羅斯搜索引擎公司。官方提供的文檔表名，ClickHouse 日處理記錄數\"十億級\"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個列式存儲數據庫的跑分要超過很多流行的商業MPP數據庫軟件，例如Vertica。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"特性："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.真正的面向列的DBMS"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.數據壓縮"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.磁盤存儲的數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"覽量和會話。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.多核並行處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.在多個服務器上分佈式處理"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6.SQL支持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7.向量化引擎"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"8.實時數據更新"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"9.索引"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"10.支持在線查詢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"11.支持近似計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"12.數據複製和對數據完整性的支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用ClickHouse也有其本身的限制，包括："}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"缺少高頻率，低延遲的修改或刪除已存在數據的能力。僅能用於批量刪除或修改數據。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"沒有完整的事務支持"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不支持二級索引"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有限的SQL支持，join實現與衆不同"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不支持窗口功能"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元數據管理需要人工干預維護"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"目前還沒有一個OLAP系統能夠滿足各種場景的查詢需求。其本質原因是，沒有一個系統能同時在數據量、性能、和靈活性三個方面做到完美，每個系統在設計時都需要在這三者間做出取捨。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://xie.infoq.cn/article/77ec0d231d36c963a8e6d1630"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://www.jianshu.com/p/26c18e6a30c3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://www.jianshu.com/p/4d0e0b42a3b0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://www.jianshu.com/p/257ff24db397"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://www.cnblogs.com/tgzhu/p/6033373.html"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://zhuanlan.zhihu.com/p/29385628"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://blog.csdn.net/yongshenghuang/article/details/84925941https://www.jianshu.com/p/b5c85cadb362"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://clickhouse.yandex/docs/zh/development/architecture/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"http://www.clickhouse.com.cn"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://www.jianshu.com/p/a5bf490247ea"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://blog.csdn.net/weixin_34273481/article/details/89238947"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://blog.csdn.net/warren288/article/details/80629909"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"往期推薦"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3NDUzMjA0MQ==&mid=2247483929&idx=2&sn=10d2b5d3270b0567626006595eee479f&chksm=fd31b2bfca463ba97b776c0f4dbd68e77b93da38dbd4e247e86f0cd43ac1483081fe6ee31eab&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"2020 年 Flink 最佳學習路線，學習的路上，你，並不孤單"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3NDUzMjA0MQ==&mid=2247483885&idx=1&sn=cdb565ebeb37b5ac56f9b87f7ce86ad6&chksm=fd31b14bca46385d0c2c592e48181c8c0a347a1acf5807f7c64b26cd39e6cde05f52ae915dc4&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"Apache Flink OLAP引擎性能優化及應用"}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3NDUzMjA0MQ==&mid=2247483889&idx=1&sn=b1f3f68ab569f75f6270c683bf72ff2f&chksm=fd31b157ca463841381e1ed6c648eb50146ea870c5fe46fd2e20841141a99b35898d16c11833&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"【乾貨】趣頭條基於 Flink+ClickHouse 構建實時數據分析平臺"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3NDUzMjA0MQ==&mid=2247483929&idx=1&sn=3c0352779666a7a4702938a6dac1b597&chksm=fd31b2bfca463ba966a8837d657170bf0256a0f24c62cf69df52af8c50fac645b467eba165e9&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"來了來了，2020 首場 Meetup ，可！"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3NDUzMjA0MQ==&mid=2247483952&idx=2&sn=62f49e32af4564b009c3cac0802874a7&chksm=fd31b296ca463b803813fd99419f1a3730f14d59ff74ed60a4b9f281180f0ef17a851484383b&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"本地Spark連接遠程集羣Hive(Scala/Python)"}]},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzU3NDUzMjA0MQ==&mid=2247483952&idx=1&sn=2c5ef89cf19e8282c3ab4063dda5b2fe&chksm=fd31b296ca463b80955c4c167cacf807e0707019bdd035fbb2edf3bc97a9fb07d2357f76b071&scene=21#wechat_redirect","title":null},"content":[{"type":"text","text":"Spark 性能優化指南(官網文檔)"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關注我，帶你不同角度看數據架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/87/875505bf1ebb84909036e10795dbf506.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

ClickHouse內幕（1）數據存儲與過濾機制

本文主要講述ClickHouse中的數據存儲結構，包括文件組織結構和索引結構，以及建立在其基礎上的數據過濾機制，從Part裁剪到Mark裁剪，最後到基於SIMD的行過濾機制。數據過濾機制實質上是構建在數據存儲格式之上的算法，所以在介紹過濾

2024-06-07 23:54:51

一文搞懂DevOps、DataOps、MLOps、AIOps：所有“Ops”的比較

引言近年來，"Ops"一詞在 IT 運維領域的使用迅速增加。IT 運維正在向自動化過程轉變，以改善客戶交付。傳統的應用程序開發採用 DevOps 實施持續集成（CI）和持續部署（CD）。但對於數據密集型的機器學習和人工智能（AI）應用，精

2024-06-07 14:08:38

JimuReport 積木報表 v1.7.5 版本發佈，免費的JAVA報表工具

項目介紹一款免費的數據可視化報表工具，含報表和大屏設計，像搭建積木一樣在線設計報表！功能涵蓋，數據報表、打印設計、圖表報表、大屏設計等！ Web 版報表設計器，類似於excel操作風格，通過拖拽完成報表設計。秉承“簡單、易用、專業”

2024-06-07 01:13:43

營銷系統黑名單優化：位圖的應用解析

背景營銷系統中，客戶投訴是業務發展的一大阻礙，一般會過濾掉黑名單高風險賬號，並配合頻控策略，來減少客訴，進而增加營銷效率，減少營銷成本，提升營銷質量。營銷系統一般是通過大數據分析建模，在CDP（客戶數據平臺，以客戶爲核心，圍繞數據融

京東雲開發者

2024-06-06 11:54:12

跨越雲端，華爲雲技術專家分享高效跨雲遷移實踐

本文分享自華爲雲社區《【華爲雲Stack】【大架光臨】第18期：跨越雲端，華爲雲Stack的高效跨雲遷移實踐》，作者：大架光臨。 1 背景在企業雲化的浪潮中，混合多雲已經是企業IT部署的新常態，虛擬機承載的業務佔據很大的比重。在上雲

2024-06-06 10:56:54

高效啓動DolphinScheduler工作流：Java URL調用詳解

轉載自牛肉胡辣湯在大數據分析和處理的領域中，DolphinScheduler是一個開源的分佈式工作流調度系統，可以用於調度和管理複雜的工作流任務。本文將介紹如何使用Java中的URL類來調用DolphinScheduler的API，實現啓

2024-06-04 21:21:59

【數智化人物展】白鯨開源CEO郭煒：大模型時代下DataOps驅動企業數智化升級

本文由白鯨開源CEO郭煒投遞並參與由數據猿聯合上海大數據聯盟共同推出的《2024中國數智化轉型升級先鋒人物》榜單/獎項評選。隨着大數據、人工智能技術的飛速發展，我們已邁入了一個全新的時代------大模型時代。在這個時代背景下，企業提高

2024-06-04 21:21:58

Opal 機器學習平臺：愛奇藝數智一體化實踐

01 綜述 Opal 是愛奇藝大數據團隊研發的機器學習平臺，包含特徵生產、樣本構建、模型訓練、模型部署在內的多環節 Bigdata + AI 開發服務，內置多種訓練鏡像、

愛奇藝技術產品團隊

2024-06-01 02:21:16

基於對比稀疏擾動技術的時間序列解釋框架 ContraLSP

開篇近日，由阿里雲計算平臺大數據基礎工程技術團隊主導，與南京大學、賓夕法尼亞州立大學、清華大學等高校合作，解釋時間序列預測模型的論文《Explaining Time Series via Contrastive and Locally

2024-06-01 00:25:50

向量數據庫引領 AI 創新——Zilliz 亮相 2024 亞馬遜雲科技中國峯會

2024年5月29日，亞馬遜雲科技中國峯會在上海召開，此次峯會聚集了來自全球各地的科技領袖、行業專家和創新企業，探討雲計算、大數據、人工智能等前沿技術的發展趨勢和應用場景。作爲領先的向量數據庫技術公司，Zilliz 在本次峯會上展示了最新的

2024-05-30 21:25:17

金融反欺詐指南：車險欺詐爲何如此猖獗？

青島市人民檢察院在其官方微信公衆號上發佈的梁某保險詐騙案顯示，2020 年以來，某汽修廠負責人梁某、某汽車服務公司負責人孫某，與保險公司的趙某等人相互勾結，收購二手北汽等品牌新能源汽車，併爲這些車輛購買車損險。隨後，他們利用暴雨天氣，故意製

2024-05-30 00:16:51

智能測試持續加碼，大模型引領軟件測試新生態

在軟件行業日新月異的今天，智能測試已成爲提升軟件質量的關鍵環節。大模型的崛起，更是爲軟件測試帶來了前所未有的變革。隨着AI和ML技術的突飛猛進，智能測試得到了快速發展，實現了對測試過程的自動化和智能化管理，顯著提高了測試效率和質量。如今，智

2024-05-25 02:07:17

圖表控件LightningChart JS v5.2正式發佈 - 全新的開發體驗

LightningChart JS是Web上性能特高的圖表庫，具有出色的執行性能 - 使用高數據速率同時監控數十個數據源。 GPU加速和WebGL渲染確保您的設備的圖形處理器得到有效利用，從而實現高刷新率和流暢的動畫，常用於貿易，工程，航空

2024-05-23 12:20:12

風控指南：國內車險欺詐呈現四大趨勢

2024年4月11日，國家金融監督管理總局官網發佈國家金融監督管理總局關於《反保險欺詐工作辦法（徵求意見稿）》公開徵求意見的公告。《徵求意見》共6章、37條，明確反保險欺詐工作目標是建立“監管引領、機構爲主、行業聯防、各方協同”四位一體的

2024-05-23 12:16:45

安全分析：國內一些常見的汽車保險欺詐案件

2024年3月，北京警方打掉一個故意製造事故實施騙保的專業保險詐騙犯罪團伙。此案中，某保險公司在職員工與離職員工、定點汽修廠內外勾連，通過虛構、故意製造車輛事故或對事故擴損等手段騙取理賠款。不久前，遼寧警方也破獲一起自導自演僞造車禍騙保的案

2024-05-22 00:17:52

24小時熱門文章

最新文章

最新評論文章