透過數字化轉型再談數據中臺(三):一文遍歷大數據架構變遷史

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"編者按:《透過數字化轉型再談數據中臺》系列連載 6-8 篇左右,作者結合自己在數據中臺領域多年實踐經驗,總結了數據架構知識、BI 知識,以及分享給大家一些產業互聯網實施經驗。本文是系列文章中的第三篇。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在前面兩篇 “"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/qPl0gtOYYxOhQeSqXpzz","title":null,"type":null},"content":[{"type":"text","text":"關於數字化轉型的幾個見解"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" ”、“"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/zTUxMT25uxRSKe77ET4i","title":null,"type":null},"content":[{"type":"text","text":"唯一性定理中的數據中臺"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"”提到了數據中臺發展問題。比如概念發展太快,信息量過載,以及存在廣義、狹義的數據中臺定義的差別等,涉及到的這些知識都離不開數據架構的範疇,所以這一篇我會通過大數據架構發展的視角來總結與分享。(一些知識繼承自己在2015年寫的《從數據倉庫到大數據,數據平臺這25年是怎樣進化的?》,又名"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/news\/the-development-history-of-big-data-platform","title":null,"type":null},"content":[{"type":"text","text":"我所經歷的大數據平臺發展史"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"系列),主要涉及三個方面:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從數倉架構到大數據架構總共三個時代九種架構的演進"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自己整理的大數據技術棧"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最新一代的Data Mesh 架構的數據平臺"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據平臺的發展在悄然發生變化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從現在的企業發展來看,大家的訴求重點已經從經營與分析轉爲數據化的精細運營。在如何做好精細化運營過程中,企業也面臨着來自創新、發展、內卷等的各方面壓力。 隨着業務量、數據量增長,大家對數據粒度需求從之前的高彙總逐漸轉爲過程化的細粒度明細數據,以及從T+1的數據轉爲近乎實時的數據訴求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大量的數據需求、海量的臨時需求,讓分析師、數據開發疲憊不堪。這些職位也變成了企業資源的瓶頸,傳統BI中的 Report、OLAP 等工具也都無法滿足互聯網行業個性化的數據需求。大家開始考慮如何把需求固定爲一個面向最終用戶自助式、半自助的產品,來快速獲取數據並分析得到結果,數據通過"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"各類數據產品"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對外更有針對性的數據價值傳遞。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(關於數據產品一個題外補充:當總結出的指標、分析方法(模型)、使用流程與工具有機的結合在一起時數據產品就此產生,隨着數據中臺&數據平臺的建設逐漸的進入快速迭代期,數據產品、數據產品經理這兩個詞逐漸的升溫並逐漸到今天各大公司對數產品經理崗位的旺盛訴求,目前這兩方面的方法論也逐步的體系化、具象化)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在這十幾年中,影響數據倉庫、數據平臺、數據中臺、數據湖的演進變革的因素也很多,比如"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"不斷快速迭代的業務模式與膨脹的羣體規模所帶來的數據量的衝擊,新的大數據處理技術的驅動。還有落地在數據中臺上各種數據產品的建設,比如工具化數據產品體系、各種自助式的數據產品、平臺化各種數據產品的建設。這些數據建設能力的泛化,也讓更多的大衆參與數據中臺的建設中 ,比如一些懂SQL的用戶以及分析師參與數據平臺直接建設比重增加 。還有一些原本數據中臺具備的能力也有一些逐步地被前置到業務系統進行處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一張圖看清楚大數據架構發展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫在國外發展多年,於大約在 1998-1999 年傳入中國。進入中國以後,發展出了很多專有名詞,比如數據倉庫、數據中心、數據平臺、數據中臺、數據湖等,從大數據架構角度來看可用"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"三個時代九種架構來做總結,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其中"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"前四代是傳統數據倉庫時代的架構,後面五代是大數據架構模式"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其中有兩個承前啓後的地方:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一個特殊地方是,傳統行業第三代架構與大數據第一代架構在架構形式上基本相似。傳統行業的第三代架構可以算是用大數據處理技術重新實現了一遍。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"傳統行業第四代的架構中實時部分在現代用大數據實時方式做了新的落地。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如下圖所示"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/3f\/61\/3f6c33c8c19b66a16d381b9ce87f9061.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"三個時代:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"非互聯網、互聯網、移動互聯網時代"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",每一種時代的業務特點、數據量、數據類型各不相同,自然數據架構也是有顯著差異的。"}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

行業域

非互聯網

互聯網

移動互聯網

數據來源(相對於數據平臺來講)

結構化各類數據庫(DB系統)、結構化文本、Excel表格等,少量word

Web、自定義、系統的日誌,各類結構化DB數據、長文本、視頻 主要是來自網頁

除了互聯網那些外還含有大量定位數據、自動化傳感器、嵌入式設備、自動化設備等

數據包含信息

CRM客戶信息、事務性 ERP\/MRPII 數據、資金賬務數據 等。

除了傳統企業數據信息外,還含有用戶各類點擊日誌、社交數據、多媒體、搜索、電郵數據等等

除了傳統互聯網的數據外,還含有Gps、穿戴設備、傳感器各類採集數據、自動化傳感器採集數據等等

數據結構特性

幾乎都是結構化數據

非結構化數據居多

非結構化數據居多

數據存儲\/數據量

主要以DB結構化存儲爲主,從幾百兆到 百G級別

文件形式、DB形式,流方式、 從TB 到PB

文件形式、流方式、DB範式,非結構化 從TB 到PB

產生週期

慢,幾天甚至周爲單位

秒或更小爲單位

秒或更小爲單位

對消費者行爲採集與還原

粒度粗

粒度較細

粒度非常細

數據價值

長期有效

隨着時間衰減

隨着時間快速衰減"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"表格源自:《"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/news\/the-development-history-of-big-data-platform","title":null,"type":null},"content":[{"type":"text","text":"我所經歷的大數據平臺發展史"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"》"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"從數據到大數據的數據架構總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我自己對傳統數據倉庫的發展,簡單抽象爲"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"爲五個時代、四種架構"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(或許也不是那麼嚴謹)。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"五個時代大概,按照"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"兩位數據倉庫大師 Ralph kilmball、Bill Innmon 在數據倉庫建設理念上碰撞階段來作爲小的分界線:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大概在 1991 年之前,數據倉庫的實施基本採用全企業集成的模式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大概在 1992 年企業在數據倉庫實施基本採用 EDW 的方式,Bill Innmon 博士出版了《如何構建數據倉庫》,裏面清晰的闡述了EDW架構與實施方式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1994-1996 年是數據集市時代,這個時代另外一種維度建模、數據集市的方式較爲盛行起來,其主要代表之一 Ralph Kimball 博士出版了他的第一本書“The DataWarehouse Toolkit”(《數據倉庫工具箱》),裏面非常清晰的定義了數據集市、維度建模。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大概在 1996-1997 年左右的兩個架構競爭時代。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1998-2001 年左右的合併年代。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在主要歷史事件中提到了兩位經典代表人物:Bill Innmon、Ralph kilmball。這兩位在數據界可以算是元祖級別的人物。現在數據中臺\/平臺的很多設計理念依然受到他倆90年代所提出方法論爲依據。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#1a1a1a","name":"user"}}],"text":"經典的 BIll Inmon 和 "},{"type":"text","text":"Ralph kilmball"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#1a1a1a","name":"user"}}],"text":" 爭論"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Bill Inmon 提出的"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"遵循的是自上而下的建設原則,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ralph kilmball提出"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"自下而上的建設原則,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"兩種方法擁護者會在不同場合爭論哪一種方法論更有優勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"兩位大師對於建設方法爭論要點:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其中Bill Inmon的方法論:認爲僅僅有數據集市是不夠的,提倡先必須得從企業級的數據模型角度入手來構建。企業級模型就有較爲完善的業務主題域劃分、邏輯模型劃分,在解決某個業務單元問題時可以很容易的選擇不同數據路徑來組成數據集市。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"後來數據倉庫在千禧年傳到中國後,幾個大實施廠商都是遵守該原則的實施方法,也逐漸的演進成了現在大家熟悉的數據架構中關於數據層次的劃分 :"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ods-> DW-> ST->應用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ods->DWD->DW->DM ->應用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ods->DWD->DWB->DWS ->應用"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ods->DWD->DW->ST(ADM)->應用"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上個 10 年的國內實施數據倉庫以及數據平臺企業,有幾家專業的廠商:IBM、Teradata、埃森哲、菲奈特 (被東南收購)、亞信等。這些廠商針對自己領域服務的客戶,從方案特點等一系列角度出發,在實施中對 ODS 層、EDW、DM 等不同數據層逐步地賦予了各種不同的功能與含義。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在大家熟知的數據模型層次劃分,基本上也是傳承原有的Bill Inmon的方法論。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":"2","normalizeStart":"2"},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據集市年代的代表人物爲 Ralph kilmball,他的代表作是 《The Data Warehouse Toolkit》。這本書就是大名鼎鼎的《數據倉庫工具箱》。企業級數據的建設方法主張自下而上建立數據倉庫,極力推崇創建數據集市,認爲數據倉庫是數據集市的集合,信息總是被存儲在多維模型中。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這種思想從業務或部門入手,設計面向業務或部門主題數據集市。隨着更多的不同業務或部門數據集市實施落地,此時企業可以根據需要來合併不同的數據集市,並逐步形成企業級的數據倉庫,這種方式被稱爲自下而上(Botton-up)方法。這個方法在當時剛好與 Bill Innmon 的自上而下建設方法相反。 "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

類比

BIll Inmon提出的方法論

Ralph kilmball 提出的方法論

建設週期

需要花費大量時間

建設週期短、花費較少時間

維護難易度

容易維護

維護成本高

建設成本

前期投入大,後期建設成本低

前期投入較少,後續迭代成本與之前投入差不多

建設週期

週期長,見效慢

短、平、快

需要的團隊類型

專業團隊搭建

比較專業團隊搭建,少量人蔘與

數據集成需求

全企業生命週期數據集成

企業垂直業務領域數據集成

面向用戶羣體

潛在的全企業用戶

業務需求部門

專業術語

面向主題、隨時間而變化、保留歷史、數據集成

面向具體業務部門的一份比較窄的數據快照,維度建模、雪花模型、星型模型

數據模型

準三範式設計原則

星型結構、雪花結構"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"隨着數據倉庫的不斷實踐與迭代發展,從爭吵期進入到了合併的時代,其實爭吵的結果要麼一方妥協,要麼新的結論出現。Bill inmon 與 Ralph kilmball 的爭吵沒有結論,乾脆提出一種新的架構包含對方,也就是後來 Bill Inmon 提出的 CIF(corporation information factory)信息工廠的架構模式,這個架構模式將 Ralph kilmball 的數據集市包含了進來,有關兩種數據倉庫實施方法論的爭吵才逐步地平息下來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"非互聯網四代架構"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第一代edw 架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/14\/ea\/149b79f96f58af920d853dec91df5cea.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在數據建設中使用到的“商業智能” 、“信息倉庫”等很多專業術語、方法論,基本上是在上世紀60年代至90年代出現的。比如“維度模型”這個詞是上世紀 60 年代 GM 與 Darmouth College 大學第一次提出, “DatawareHouse”、“事實” 是在上個世紀70年代BIll Inmon 明確定義出來的,後來 90 年代 BIll Inmon 出版《如何構建數據倉庫》一書更加體系化的與明確定義瞭如何構建數據倉庫,這套方法在落地上形成了第一代數據倉庫架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在第一代的數據倉庫中,清晰地定義了數據倉庫(Data Warehouse) 是一個"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"面向主題"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#4d4d4d","name":"user"}}],"text":"Subject Oriented"},{"type":"text","text":") 、集成的( Integrate ) 、相對穩定的(Non -Volatile ) 、反映歷史變化( Time Variant) 的數據集合,用於支持管理決策( Decision Marking Support)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"首先,數據倉庫(Data Warehouse)是用來支持決策的、面向主題的用來支撐分析型數據處理的,這裏有別於企業使用的數據庫。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據庫、數據倉庫小的區別:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據庫系統的設計目標是事務處理。數據庫系統是爲記錄更新和事務處理而設計,數據的訪問的特點是基於主鍵,大量原子,隔離的小事務,併發和可恢復是關鍵屬性,最大事務吞吐量是關鍵指標,因此數據庫的設計都反映了這些需求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫的設計目標是決策支持。歷史的、摘要的、聚合的數據比原始的記錄重要的多。查詢負載主要集中在即席查詢和包含連接,聚合等複雜查詢操作上。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其次,數據倉庫(Data Warehouse)是對多種異構數據源進行有效集成與處理,是按照主題的方式對數據進行重新整合,且包一般不怎麼修改的歷史數據,一句話總結面向主題、集成性、穩定性和時變性。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫(Data Warehouse)從特點上來看:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫是面向主題的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫是集成的,數據倉庫的數據有來自於分散的操作型數據,將所需數據從原來的數據中抽取出來,進行加工與集成,統一與綜合之後才能進入數據倉庫。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫是不可更新的,數據倉庫主要是爲決策分析提供數據,所涉及的操作主要是數據的查詢。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 數據倉庫是隨時間而變化的,傳統的關係數據庫系統比較適合處理格式化的數據,能夠較好的滿足商業商務處理的需求,它在商業領域取得了巨大的成功。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據倉庫和數據庫系統的區別,一言蔽之:OLAP 和 OLTP 的區別。數據庫支持是 OLTP,數據倉庫支持的是 OLAP。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二代大集市架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/cd\/2d\/cd391c756255a28d035d6d193bd2712d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二代就是 Ralph kilmball 的大集市的架構。第二代架構基本可以成爲總線型架構,從業務或部門入手,設計面向業務或部門主題數據集市。Kilmball 的這種構建方式可以不用考慮其它正在進行的數據類項目實施,只要快速滿足當前部門的需求即可,這種實施的好處是阻力較小且路徑很短。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是考慮到在實施中可能會存在多個並行的項目,是需要在數據標準化、模型階段是需要進行維度歸一化處理,需要有一套標準來定義公共維度,讓不同的數據集市項目都遵守相同的標準,在後面的多個數據集市做合併時可以平滑處理。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"比如業務中相似的名詞、不同系統的枚舉值、相似的業務規則都需要做統一命名,這裏在現在的中臺就是全域統一 ID 之類的東西。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"主要核心:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一致的維度,以進行集成和全面支持。一致的維度具有一致的描述性屬性名稱、值和含義。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一致的事實是一致定義的;如果不是一致的業務規則,那麼將爲其指定一個獨特的名稱。業務中相似的名詞、不同系統的枚舉值、相似的業務規則都需要做統一命名。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"建模方式:星型模型、雪花模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三代彙總維度集市&CIF2.0數倉結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/fb\/f1\/fbc35ab2c6277195e013976eb52cd7f1.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/2e\/63\/2ef0aef37869946e79156244604e0063.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"CIF(corporation information factor)信息工廠(作者備註,關於 Cif 的英文版文章名字 Corporate Information Factory (CIF) Overview),Bill Inmon 認爲企業的發展會隨着信息資源重要性會逐步的提升,會出現一種信息處理架構,類似工廠一樣能滿足所有信息的需求與請求。這個信息工廠的功能包含了數據存儲與處理(活躍數據、沉默數據),支持跨部門甚至跨企業的數據訪問與整合,同時也要保證數據安全性等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"剛好 CIF 架構模式也逐步的變成了數據倉庫第三代架構。爲什麼把這個 CIF 架構定義成一個經典架構呢,因爲 CIF 的這種架構總結了前面提到的兩種架構的同時,又把架構的不同層次定義得非常明確。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"例如 CIF 2.0 主要包括集成轉換層(Integrated and Transformation Layer)、操作數據存儲(Operational Data Store)、數據倉庫(Enterprise Data Warehouse)、數據集市(Data Mart)、探索倉庫(Exploration Warehouse)等部件。Data Mart 分爲後臺(Back Room)和前臺(Front Room)兩部分。後臺主要負責數據準備工作,稱爲數據準備區(Staging Area),前臺主要負責數據展示工作,稱爲數據集市(Data Mart)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這個經典的架構在後來 2006 年~2012 年進入到這個領域的從業者,乃至現在有些互聯網企業的數據平臺架構也是相似的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第四代 OPDM操作實時數倉"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/0f\/57\/0f65445b4aa39d961b2a355ca0e3b657.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"OPDM 大約是在 2011 年提出來的,嚴格上來說,Opdm  操作型數據集市(倉庫)是實時數據倉庫的一種,他更多的是面向操作型數據而非歷史數據查詢與分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在這裏很多人會問到什麼是操作型數據?比如財務系統、CRM 系統、營銷系統生產系統,通過某一種機制實時的把這些數據在各數據孤島按照業務的某個層次有機的自動化整合在一起,提供業務監控與指導。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"互聯網的五代大數據處理架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在文章的開頭有提過,傳統行業第三代架構與大數據第一代架構在架構形式上基本相似,只不過是通過大數據的處理技術嘗試對傳統第三架構進行落地的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"比如說在Hadoop&Hive 剛興起的階段,有用SyaseIQ、Greenplum等技術來作爲大數據處理技術,後來Hadoop&hive以及Facebook Scribe、Linkedin kafka等逐步開源後又產生了新的適應互聯網大數據的架構模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"後續阿里巴巴淘系的TImeTunnel等更多的近百種大數據處理的開源技術,進一步促進了整個大數據處理架構與技術框架的發展,我在後面會給出一個比較完善截止到目前所有技術的數據處理框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"按照大數據的使用場景、數據量、數據的類型,在架構上也基本上分爲流式處理技術框架、批處理技術框架等, 所以互聯網這五代的大數據處理框架基本上是圍繞着批處理、流式處理以及混合型架構這三種來做演進。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第一代離線大數據統計分析技術架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/09\/15\/09c7b039b9e3be45c5d692231efd2215.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這個結構與第三代的數據處理架構非常相似,具體如下圖所示:"}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

數據階段

傳統行業第三代架構

第一代離線大數據統計架構

數據源

結構化數據爲主(數據庫數據、內部辦公數據、財務數據等)、非結構化數據很少或者是沒有

結構化數據爲主(數據庫數據、內部辦公數據、財務數據等)、結構化數據開始多起來

數據處理

名詞:ETL爲主,在數據如中央倉庫之前已經開始很多的數據轉換、歸一化的處理

技術:Datastage、informa、Dts、C、腳本等等

名詞:ELT爲主,主要是數據採集傳輸與歸集、很少做數據歸一化以及轉換處理 。主要是把數據先歸集到中央庫自作處理

技術:kafka、Datax 等

數據中央處理

技術:Oracle、DB2、SybaseIQ、Teradata

數據模型:維度模型、準三範式

技術:hadoop、hive、spark

數據模型:維度模型、大寬表等

數據應用

成型的解決方案產品:Report、OLAP、在線分析等

成型的軟件產品變少、開源技術、自助研發產品變多起來"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這代架構定位是爲了"},{"type":"text","marks":[{"type":"italic"},{"type":"color","attrs":{"color":"#f5222d","name":"user"}},{"type":"strong"}],"text":"解決傳統BI的問題"},{"type":"text","text":",簡單來說,數據分析的業務沒有發生任何變化,但是因爲數據量、性能等問題導致系統無法正常使用,需要進行升級改造,此類架構便是爲了解決這個問題。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第二代流式架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/3a\/e4\/3ae0520b07f8d1e0ayyfc137384bf3e4.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"流式的應用場景非常廣泛, 比如搜索、推薦、信息流等都是在線化的,對數據實時性的要求變更高,自然計算與使用是同步進行的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"隨着業務的複雜化,數據的處理邏輯更加複雜,比如各種維度交叉、關聯、聚類,以及需要更多算法或機器學習。這些應用場景可以完全地分爲兩類:事件流、持續計算。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"事件流,就是業務相對固定,只是數據在業務的規則下不斷的變化。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"持續計算,適合購物網站等場景。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"流式計算處理框架與第一代的大數據處理框架相比,去掉了原有的ETL過程,數據流過數據通道時得到處理,處理結果通過消息的方式推送數據消費者。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"流式計算框架捨棄了大數據離線批量處理模式,只有很少的數據存儲,所以數據保存週期非常短。如果有歷史數據場景或很複雜歷史數據參與計算的場景,實現起來難度就比較大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在一些場景,會把流式計算的結果數據週期性地存到批處理的數據存儲區域。如果有場景需要使用歷史數據,流式計算框架會把保存的歷史結果用更新的方式進行加載,再做進一步處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"第三代 Lambda 大數據架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/a2\/3e\/a2eab814f3b5223127818309db3bb93e.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Lambda架構是由Twitter工程師南森·馬茨(Nathan Marz)提出的,是一種經典的、實施廣泛的技術架構。後來出現的其他大數據處理架構也是Lambda 架構的優化或升級版。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Lambda 架構有兩條數據鏈路,一條兼顧處理批量、離線數據結構,一條是實時流式處理技術 。"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"批量離線處理流在構建時大部分還是採用一些經典的大數據統計分析方法論,在保證數據一致性、完整性的同時還會對數據按照不同應用場景進行分層。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"實時流式處理主要是增量計算,也會跑一些機器學習模型等。爲了保證數據的一致性, 實時流處理結果與批量處理結果會有一個合併動作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Lambda架構主要的組成是批處理、流式處理、數據服務層這三部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"批處理層(Bathchlayer) : Lambda架構核心層之一,批處理接收過來的數據,並保存到相應的數據模型中,這一層的數據主題、模型設計的方法論是繼承面向統計分析離線大數據中的。 而且一般都會按照比較經典的ODS、DWD、DWB、ST\/ADM 的層次結構來劃分。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"流式處理層(Speed Layer) : Lambda另一個核心層,爲了解決比如各場景下數據需要一邊計算一邊應用以及各種維度交叉、關聯的事件流與持續計算的問題,計算結果在最後與批處理層的結果做合併。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"服務層( Serving layer) :這是Lambda架構的最後一層,服務層的職責是獲取批處理和流處理的結果,向用戶提供統一查詢視圖服務。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Lamabda 架構理念從出現到發展這麼多年,優缺點非常明顯。 比如穩定與性能上的優勢,ETL處理計算利用晚上時間來做,能複用部分實時計算的資源。劣勢,兩套數據流因爲結果要做合併,所有的算法要實現兩次,一次是批處理、一次是實時計算,最終兩個結果還得做合併顯得會很複雜。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa 大數據架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/7a\/66\/7a002da9a1d803097ca5aff38eda7f66.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在Lamadba 架構下需要維護兩套的代碼,爲了解決這個問題,LinkedIn公司的Jay Kreps 結合實際經驗與個人思考提出了Kappa 架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa 架構核心是通過改進流式計算架構的計算、存儲部分來解決全量的問題,使得實時計算、批處理可以共用一套代碼。Kappa 架構認爲對於歷史數據的重複計算機率是很小的,即使需要,可以通過啓用不同的實例的方式來做重複計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其中Kappa的核心思想是:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"用Kafka或者類似MQ隊列系統收集各種各樣的數據,需要幾天的數據量就保存幾天。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當需要全量重新計算時,重新起一個流計算實例,從頭開始讀取數據進行處理,並輸出到一個新的結果存儲中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當新的實例做完後,停止老的流計算實例,並把一些老的結果刪除。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa架構的優點在於將實時和離線代碼統一起來,方便維護而且統一了數據口徑。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa 架構與Lamabda 架構相比,其優缺點是:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Lambda架構需要維護兩套跑在批處理和實時流上的代碼,兩個結果還需要做merge, Kappa 架構下只維護一套代碼,在需要時候才跑全量數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa 架構下可以同時啓動很多實例來做重複計算,有利於算法模型調整優化與結果對比,Lamabda架構下,代碼調整比較複雜。所以kappa架構下,技術人員只需要維護一個框架就可以,成本很小。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"kappa 每次接入新的數據類型格式是需要定製開發接入程序,接入週期會變長。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kappa這種架構過度依賴於Redis、Hbase 服務,兩種存儲結構又不是滿足全量數據存儲的,用來做全量存儲會顯得浪費資源。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Unified 大數據架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/d2\/34\/d269194158b049b2d541bf2755b5a934.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以上的這些架構都圍繞大數據處理爲主,Unifield架構則更激進,將機器學習和數據處理整合爲一體,從核心上來說,Unifield在Lambda基礎上進行升級,在流處理層新增了機器學習層。數據經過數據通道進入數據湖,新增了模型訓練部分,並且將其在流式層進行使用。同時流式層不單使用模型,也包含着對模型的持續訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"IOTA架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"IOTA大數據架構是一種基於AI生態下的、全新的數據架構模式,這個概念由易觀於2018年首次提出。IOTA的整體思路是設定標準數據模型,通過邊緣計算技術把所有的計算過程分散在數據產生、計算和查詢過程當中,以統一的數據模型貫穿始終,從而提高整體的計算效率,同時滿足計算的需要,可以使用各種Ad-hoc Query來查詢底層數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"主要有幾個特點:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"去ETL化:ETL和相關開發一直是大數據處理的痛點,IOTA架構通過Common Data Model的設計,專注在某一個具體領域的數據計算,從而可以從SDK端開始計算,中央端只做採集、建立索引和查詢,提高整體數據分析的效率。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ad-hoc即時查詢:鑑於整體的計算流程機制,在手機端、智能IOT事件發生之時,就可以直接傳送到雲端進入real time data區,可以被前端的Query Engine來查詢。此時用戶可以使用各種各樣的查詢,直接查到前幾秒發生的事件,而不用在等待ETL或者Streaming的數據研發和處理。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"邊緣計算(Edge-Computing):將過去統一到中央進行整體計算,分散到數據產生、存儲和查詢端,數據產生既符合Common Data Model。同時,也給與Realtime model feedback,讓客戶端傳送數據的同時馬上進行反饋,而不需要所有事件都要到中央端處理之後再進行下發。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"可能是由於我接觸到的範圍有限,暫時還沒有遇到一家企業完整按照IOTA這個架構模式來實施的,暫時沒有更多的個人經驗來分享這塊。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"小結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大數據架構的每一代的定義與出現是有必然性的, 當然沒有一個嚴格上的時間區分點。直接給出一個每種架構比較:"}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

架構

優點

缺點

適用場景

離線大數據統計分析技術架構

簡單,易懂,對於BI系統來說,基本思想沒有發生變化,變化的僅僅是技術選型,用大數據架構替換掉BI的組件。

對於大數據來說,沒有BI下如此完備的Cube架構,雖然目前有kylin,但是kylin的侷限性非常明顯,遠遠沒有BI下的Cube的靈活度和穩定度,因此對業務支撐的靈活度不夠,所以對於存在大量報表,或者複雜的鑽取的場景,需要太多的手工定製化,同時該架構依舊以批處理爲主,缺乏實時的支撐。

數據分析需求依舊以BI場景爲主,但是因爲數據量、性能等問題無法滿足日常使用。

流式架構

沒有臃腫的ETL過程,數據的實效性非常高。

對於流式架構來說,不存在批處理,因此對於數據的重播和歷史統計無法很好的支撐。對於離線分析僅僅支撐窗口之內的分析。

預警,監控,對數據有有實時性要求的場景。

Lambda架構

既有實時又有離線,對於數據分析場景涵蓋的非常到位。

離線層和實時流雖然面臨的場景不相同,但是其內部處理的邏輯卻是相同,因此有大量榮譽和重複的模塊存在。

同時存在實時和離線需求的情況。

Kappa架構

Kappa架構解決了Lambda架構裏面的冗餘部分,以數據可重播的超凡脫俗的思想進行了設計,整個架構非常簡潔。

雖然Kappa架構看起來簡潔,但是實施難度相對較高,尤其是對於數據重播部分。

和Lambda類似,改架構是針對Lambda的優化。

Unifield架構

Unifield架構提供了一套數據分析和機器學習結合的架構方案,非常好的解決了機器學習如何與數據平臺進行結合的問題。

Unifield架構實施複雜度更高,對於機器學習架構來說,從軟件包到硬件部署都和數據分析平臺有着非常大的差別,因此在實施過程中的難度係數更高。

有着大量數據需要分析,同時對機器學習方便又有着非常大的需求或者有規劃。

IOTA架構

去ETL化、支持Ad-hoc即時查詢和邊緣計算。

代碼漏洞較多,通過收費方式向社區提供漏洞修復代碼。

IOTA用於物聯網設備,實現萬物互聯、系統自治。"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"架構講完了,落地肯定是離不開技術的,我之前花了不少時間整理了一下目前大數據方向的技術棧的內容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大數據處理技術棧"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"分享完了架構,在從大數據技術棧的角度來看看對應的數據採集、數據傳輸、數據存儲、計算、ide管理、分析可視化微服務都有哪些技術,下圖的技術棧我花了蠻多的時間梳理的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/54\/d7\/54508aeyy18e1cf8bbdc9d84af906dd7.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"按照數據採集-傳輸-落地到存儲層,再通過調度調起計算數據處理任務把整合結果數據存到數據倉庫以及相關存儲區域中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通過管理層\/ide 進行數據管理或數據開發。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通過OLAP 、分析、算法、可視化、微服務層對外提供數據服務與數據場景化應用。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這個技術棧暫時沒有按照沒有按照批處理、流式技術的分類的角度來分類,稍微有點遺憾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Data Mesh 面向域的分散式數據架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Data Mesh 是在2019年左右,由 ThoughtWorks的首席技術顧問Zhamak Dehghani提出的(《How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh》"},{"type":"link","attrs":{"href":"https:\/\/martinfowler.com\/articles\/data-monolith-to-mesh.html","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/martinfowler.com\/articles\/data-monolith-to-mesh.html"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")。她將對客戶進行企業數據平臺實施過程出現的問題和麪向領域設計中的微服務結合了起來,思考出來了一種新式面向域的數據架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"企業面向數據平臺的實施中,不管是數據BI系統,還是基於大數據(數據湖)架構模式,或者是基於雲數據平臺,無一例外地延續着一個架構(Monolithic Architecture)的核心模式,只是這個架構的表現形式從一個嚴格規範化的數據倉庫,到更加專業的大數據(數據湖),最終轉化成一個多種實踐模式的混合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在這些大數據平臺實施與解決方案難以通過簡單複製來達到規模化、商業化,企業數據平臺項目實施要三到五年的時間,巨大的投入使得投入產出比不夠高,很難獲取預期的收益。"},{"type":"text","marks":[{"type":"italic"},{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"原文提到Zhamak Dehghani"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"基於對企業數據平臺架構現狀和弊端以及微服務的視角提出了Data Meth 面向域的分佈式架構模式。這個架構模式有四個特點:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"面向域的數據架構"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自服務式的平臺基礎設施"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據產品導向的管理與角色"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"基於靈活、規模化、演進式的基礎設施交付能力"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"講一下自己的理解(可能理解還是比較淺):"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"面向域的數據架構:對數據內容即插即用的類似一種SaaS能力,比如根據領域建模來設計數據模型,比如之前IBM 的IAA模型,Teradata 金融標準模型提到的用戶主題域,參與者主題域、地址主題、集團客戶主題等等,這類主題有自己的數據接入標準。比如通過數據處理流程、統計指標、數據資產管理的模板化配置,只關心輸入內容就可以得到完整輸出,並且自動完成合規、安全、管理以及運營型的一些工作。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自服務式的平臺基礎設施:爲數據域架構中提到落地功能提供必要的產品能力,例如提供各種快速組件化、配置化的基礎模板工具。像是提供自動化數據加工管理,數據模型建模到自動化ETL過程,指標維度分析模板、數據應用模板等,最終實現自動化與規模化。(以前自己設計過一個自動化ETL引擎,並實現了最小迷你版落地,但是待解決的問題還比較多)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"數據產品導向的管理與角色,數據的價值本身就是透過數據產品對外進行傳遞,從數據產品的角度來說偏業務數據產品、偏工具平臺數據產品,這些都是在推進數據平臺的建設,自然不管數據價值的透傳方式、效率、洞察等都會通過租戶使用平臺工具去建設自己的數據能力。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自己也在思考未來給企業提供的數據服務能力是什麼樣子,以及基於元數據驅動數據中臺\/平臺是什麼樣子的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"寫在結尾"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自己在2015年時曾經寫過一篇從"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"數據團隊組織變化角度來分享大數據的架構進化的文章,這次從大數據處理架構做了一個發展總結,兩個角度基本上涵蓋了數據中臺\/平臺建設比較重點兩個問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在上一篇中提到一個話題:數據中臺是有組織結構的保障,很多地方都有提到數據中臺必須得有強力的組織上的保障!確實需要嗎?我的觀點是什麼呢?這個系列的下一篇給大家講解數據中臺的組織結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"相關文章:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/qPl0gtOYYxOhQeSqXpzz","title":null,"type":null},"content":[{"type":"text","text":"透過數字化轉型再談數據中臺(一):關於數字化轉型的幾個見解"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/article\/zTUxMT25uxRSKe77ET4i","title":null,"type":null},"content":[{"type":"text","text":"透過數字化轉型再談數據中臺(二):唯一性定理中的數據中臺"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"作者簡介:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"松子(李博源),BI& 數據產品老兵一枚,漂過幾個大廠。2016 年到現在持續輸出原創內容幾十篇,《中臺翻車紀實》 、《從數據倉庫到大數據,數據平臺這 25 年是怎樣進化的》 、《數據產品三部曲系列》等系列有思考深度的文章。"}]}]}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章