從廣告監測到知識圖譜,明略千億大數據處理能力是如何煉成的?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網購、叫車、訂外賣、看電影......移動互聯網各種場景的背後都離不開大數據技術。經過十幾年的發展,大數據技術已經成爲互聯網企業的基礎設施。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"源起谷歌“三駕馬車”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"聊起大數據,就繞不開谷歌的“三駕馬車“。早在 2003 年,谷歌發表第一篇論文——谷歌文件系統(GFS);第二年,谷歌再次發表一篇論文——分佈式計算框架MapReduce;2006年,谷歌發表第三篇論文——NoSQL數據庫系統BigTable。這三篇論文由此開啓了大數據時代。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"徐飛在《大數據浪潮之巔:新技術商業制勝之道》一書中寫道,“通過‘三駕馬車’這一利器,谷歌具備了存儲和分析海量數據的能力,其個性化廣告系統猶如永動的印鈔機,不斷爲谷歌賺取財富。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"受谷歌“三駕馬車”的影響,其他互聯網公司也在嘗試大規模分佈式系統,希望構建強大的數據存儲、分析和處理平臺。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過,當時正處於前 Hadoop 時期,互聯網公司基本上都在摸着石頭過河。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據收集和計算領域的先驅"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在衆多的互聯網公司中,成立於2006年的秒針系統無疑是這個領域的先行者。據秒針系統產研中心負責人劉沛 介紹,2008 年Hadoop還沒有成熟,他們從零研發了自己的大數據平臺,“思路跟 Hadoop MapReduce 類似,"},{"type":"text","marks":[{"type":"strong"}],"text":"一天也能處理幾十億數據"},{"type":"text","text":"”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劉沛在 2007 年加入秒針,那時他還在讀大三。一年後,他正式畢業,留在秒針系統。他先後領導了包括廣告監測系統 AdMonitor 等核心產品的研究和開發。作爲秒針系統的老人,他見證了秒針系統大數據平臺從 0 到 1 的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據悉,秒針系統的業務是廣告監測,核心產品是 AdMonitor。在 AdMonitor 的服務鏈路中,前端負責收集數據。每個廣告會被嵌入一個發送到秒針系統域名的代碼。一旦廣告在媒體端被點擊,它就會把被嵌入的代碼發回到秒針系統的服務器。這樣,系統就知道完成了一次廣告曝光。這樣的一個廣告業務流程主要涉及"},{"type":"text","marks":[{"type":"strong"}],"text":"數據採集、數據存儲、數據計算和數據分析技術"},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"多端收集數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼,第一個問題來了,秒針系統怎麼收集數據?據劉沛介紹,在 PC 時代,大多使用 JavaScript 來採集數據。這就要求秒針系統的產品要適配每一個瀏覽器,包括 Firefox、IE、傲遊瀏覽器、海豚瀏覽器等。據悉,cookie 是當時數據收集使用的主要技術之一。除 cookie 之外,結合 Flash。那時,幾乎所有的廣告都是 Flash,因爲 Flash 本身是一個可執行程序,所以能在其內部編程,把監測代碼放在裏面,收集數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劉沛表示,“Flash 也有 cookie 的概念,技術術語叫 FSO。把 FSO 和 cookie 做各種聯動,實現持久化。這邊刪了,那邊能恢復;那邊刪了,這邊再恢復。在保護用戶隱私的前提下更精準地識別出一個獨立用戶。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到了 2012 年,智能手機出現,Android 和 iOS App 數量不斷增多,秒針系統又在 AdMonitor 產品中增加移動端廣告測量能力。SDK 技術成爲當時移動端數據收集的主要方式。劉沛稱,“Android、iOS 都是新事物,不僅要學習新的編程語言,還要面對新技術環境進行開發。做出一款應用後,要適配廠商不同機型的不同型號。除硬件外,還要適應手機上運行的各種 App”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉個例子,愛奇藝、優酷和騰訊視頻是三大主流視頻 App。SDK 要在之上運行,前期要做各種對接測試,保證運轉正常。“不能讓 App死機,也不能拖慢了它的系統運轉。另外,數據採集結果要和他們上報的一致。因此,每加入一款主流 App,都得做技術對接和數據測試。”他說。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2012 年 8 月,秒針系統正式推出中國第一個移動端廣告加載 SDK,“很快就被加進了主流的 App 中”。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"用 RAID 5 搞定數據存儲難題"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時任秒針系統大數據平臺運維負責人任鑫琦向 InfoQ 記者透露,"},{"type":"text","marks":[{"type":"strong"}],"text":"秒針系統的業務量當時非常大,佔到全國所有廣告監測流量的 60%,收集數據的服務器每天 PV 量超過 100 億。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這麼多數據,如何存儲?據劉沛介紹,當時使用了 RAID(獨立磁盤冗餘陣列)技術,具體說是 RAID 5 技術:數據在寫入磁盤時,將數據分成 N-1 份,併發寫入 N-1 塊磁盤,校驗數據螺旋式寫入所有磁盤。這樣保證了 RAID 5 既有較快的訪問速度,又有較高的數據可靠性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用劉沛的話解釋,“一個集羣中,一份數據被切片後存在不同地方。如果一塊磁盤銷燬了,還能從別處恢復”。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"百億規模的數據計算問題,怎麼解?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據收集上來後,關鍵是數據計算。任鑫琦介紹,計算分爲兩類:第一類是按小時進行批量計算,這要求平臺具備大規模數據的處理能力。第二類是實時計算,這要保證實時計算的可靠性,否則計算延遲,“客戶看到的數據就不準確”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據悉,"},{"type":"text","marks":[{"type":"strong"}],"text":"秒針系統當時一天有 100 多億數據"},{"type":"text","text":"。其單臺日志服務器的承載性能是“滿負荷運行,一天可以處理 4 個億的數據”。實際中,一般按照 50%的負載使用率,即一臺日誌服務器一天要處理 2 億數據。這樣算下來,大概需要 50 臺日志服務器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當數據量超過一臺服務器的承載能力時,前端要分成很多臺服務器做負載均衡。比如,監測代碼加在各種各樣的媒體上,每個廣告主在多個媒體上投放,而每個媒體同時又承載多個廣告主,每個媒體又有不同的廣告位,“所以要把這些全部用監測代碼 ID索引好”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劉沛稱,“每個廣告被曝光或點擊時,這條請求是發到了哪臺服務器,都要有一套統一的調度規則,保證每臺服務器的承壓一致,保證每臺服務器分工合理。這樣整體性能就會最好”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據計算架構上,由於 Hadoop 當時不成熟,所以秒針系統使用了一個開源的分佈式文件系統 KFS。任鑫琦說:“基於 KFS,我們沒有用 Hadoop 零點幾版本的架構,因爲不太穩定,其管理節點不是高可用的。”Hadoop 在 2.0 版本之前,其 NameNode 只有一個,一旦壞了,整個集羣就會崩潰。所以,自己維護了一套分佈式計算任務的調度工具,把順序調度和背序調度相結合,再加入一些針對局部的調度技巧和優化。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"Hadoop 助力,技術能力再上一層樓"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2012 年,Hadoop 發佈 2.0 版本。它是一套全新架構,包含 HDFS Federation 和 Yarn 兩個系統。相比 1.0 版本,它更穩定,也更成熟。因此,秒針系統開始逐漸採用。但系統遷移並不是那麼容易,花了一年的時間才成功切換到 Hadoop 上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"劉沛說,一方面,版本不穩定;另一方面,所有人都是新手。出現問題找不到原因時,劉沛他們就到 Hadoop 開源社區去問,有沒有人遇到同樣問題。如果其他人也遇到這個問題,大家就一起討論怎麼辦。而有的問題,”沒有其他人遇到,就只能自己看源代碼,想辦法解決,解決不了的,再找別的解決方案,用別的東西來實現或自己寫代碼實現“。後來,隨着故障的不斷減少,技術人員的經驗越來越豐富,遷移到 Hadoop 上的大數據平臺也愈加成熟和穩定,能力變得更強。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2014 年,秒針系統達到一個新高度——"},{"type":"text","marks":[{"type":"strong"}],"text":"實現日均最高千億級廣告請求處理能力"},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"站在秒針系統肩上的明略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2012 年,大數據的概念開始火起來。此時,Hadoop 生態圈的重要角色都已入局,包括 Facebook、LinkedIn 和 Twitter 以及 Hadoop 三大發行商 Cloudera、MapR、Hortonworks。整個生態的蓬勃發展和日益完善讓 Hadoop 的市場前景變得更美好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"於是,從秒針系統孵化出一個小團隊,目標是做定製化大數據平臺。這樣,明略誕生了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任鑫琦被抽調到明略,開發大數據平臺。相比以前,開發一個大數據平臺相對更容易,因爲秒針系統的實踐積累了一些經驗,並且Hadoop 生態發展越來越完善,有更多的工具可以利用。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"技術選型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據任鑫琦介紹,技術選型的一個標誌是 Hadoop 在 2.0 時提出了 NameNode HA 框架,加入選舉機制和控制組件,可以實現大於 3 的奇數個管理節點的配置。當一個管理節點宕掉,馬上會選出第二個管理節點,這是一個真正的高可用狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此前,他們雖然一直關注 Hadoop,但是卻沒采用,原因之一是 Hadoop 1.0、1.1 版本,只有一個核心管理節點 NameNode。後來,它引入Second NameNode,即有一個主活管理節點,有一個備用節點,這兩個節點實時同步。如果主節點服務宕掉了,備用節點會提醒並繼續管理這個集羣。但是,它其實並非高可用,“因爲服務要切換,並且 Second NameNode 也會有問題”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"他說:“在 Hadoop 2.0 時,我們認爲它達到一個基本工業級可用的狀態。只要整個集羣不出太嚴重的問題,一些細節問題,比如計算效率問題、任務調度問題等,我們可以通過修改開源代碼,或調整執行任務,優化任務策略,慢慢改進。“"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,明略就把所有的技術體系切到 Hadoop 上面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2014 年 7 月,明略發佈大數據平臺 1.0 版本。據悉,1.0 版本已經相當成熟,“在集羣上架的服務器系統裝完情況下,網都通了,不能說完全一鍵部署,但是點幾鍵就能搞定部署。"},{"type":"text","marks":[{"type":"strong"}],"text":"半小時左右就可以完成一個大數據整個生態體系的部署和安裝"},{"type":"text","text":"“。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一年,明略數據成功中標中國銀聯項目,這是它在國內第一個大的企業級客戶。任鑫琦稱,“當時,任何成熟的(大數據)部署體系都無法做到半小時完成整個集羣的部署安裝和配置工作。這是我們成熟的一個標誌”。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"發力知識圖譜"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於已有的大數據技術,明略在 2015 年繼而研發出知識圖譜,核心產品是 SCOPA。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自己的大數據發展蒸蒸日上,爲什麼要去做知識圖譜?現任明略科技集團副總裁任鑫琦解釋,第一,知識圖譜技術源於搜索引擎,它把所有網頁和內容做知識化管理,這樣能更好地理解用戶搜索意圖,提供用戶想要的內容和結果。第二,差異化競爭。他說:“如果能把大量的結構化數據,從原來簡單數倉的計算一些報表,做一些查詢,轉換思路,從中抽出它本身的含義,組織成業務知識,更有效地組織數據,並且實現數據增值。這就可以跟業界很多做通用大數據處理的公司實現差異化。“"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過,他也坦承,基於大量數據做知識圖譜有着不小的難度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"難度一,數據量非常大,這涉及到整個的實時數據處理能力,包括數據融合問題、數據衝突問題。同時,業界也沒有參考的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"難度二,每個行業要建立領域知識圖譜。“這與過去的專家系統很像。知識圖譜的價值有多大,關鍵在於行業領域知識圖譜的定義,每個行業都要跟業務專家探討知識圖譜的設計,同時不停地迭代,做各種改進,這很難“。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"難度三,知識圖譜要與一些 AI 技術相結合。知識圖譜的主力場景是“從大數據裏撈知識”,最基礎的是實體與關係。據任鑫琦介紹,針對實體要做兩件事:一是數據融合,二是給實體打上明確標籤。但是實體種類非常多,怎麼打標籤,要使用很多 AI 技術。而關係的質量和數量決定了整個知識圖譜組織形式的質量,”關係沒有處理好,整個知識圖譜的可用性就會降低,它的推薦、推理、交叉分析就用不起來。關係的處理也要用到很多的 AI 技術“。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更重要的是,與之前相比,知識圖譜對背後支撐的技術平臺要求更高。爲此,任鑫琦他們在 2015 年決定做一個混合型知識圖譜數據庫。那麼,這個混合型知識圖譜要解決三個核心問題:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一是知識圖譜要能實現全文式的定位式索引查詢,比如根據一個關鍵詞定位到知識圖譜的某個點,這需要有一個全文的檢索系統;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二是知識圖譜會有很多的條件查詢,比如常規的大數據計算,按照哪一個 Key 和 ID,做查詢、統計分析;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"三是知識圖譜要有圖,要完成關係的推演,包括關係存儲。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就要求既有全文,又有大數據,還有圖。同時,還要把這三個存儲融合在一起,做好統一索引和管理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據任鑫琦透露,他們的解決辦法是"},{"type":"text","marks":[{"type":"strong"}],"text":"把 Elasticsearch、HBase 和圖數據庫 Titan 做了一致性索引的融合"},{"type":"text","text":",包括統一的數據存儲的路由、性能優化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"他說:“這個問題解決後,像怎麼做業務定義、怎麼描述圖譜的語義等問題都可以用這個混合型數據庫實現。大規模數據的融合、實時數據計算或高性能計算,這個混合型知識圖譜數據庫都可以用不同的特性支持每天更新,甚至是實時更新。”"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"明略知識圖譜的技術架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據悉,明略知識圖譜的架構如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/6b\/6b9e57ee1e878d69b85d6e876d74b827.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個架構體系中,前端有數據接入、數據彙總。之後,數據清洗,進行知識圖譜構建。在知識圖譜裏,還有實體構建、實體標籤的構建、關係構建等。同時,還有圖譜事件類或者行爲類數據的構建。這是一整套數據處理的基礎流程。往後,把這些數據加載到圖數據庫。在這之上是基於知識圖譜的可視化交互分析系統。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"知識圖譜的技術架構仍以 Hadoop 爲核心,數據接入上,最早用 Flume(現已切換到 Kafka)。據任鑫琦介紹,”如果對接的是數據庫系統,用的是 Scoop 1.0 和 2.0。數據抽取上來後,如果不屬於日誌型、庫表型,用腳本方式抽取到平臺上,落地到 HDFS;如果是結構化數據,直接落成 Hive 表。基於 Hive 層完成整個數據清洗、融合、轉換和知識圖譜構建工作,基本上用 Spark 實現整個的數據治理過程。如果是實時計算,用的是準實時 Spark Streaming 的技術選型,因爲這可以減少更多相關組件的引入”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡言之,核心圖譜庫的架構和支撐基本是一個以 Elasticsearch、HBase 和 Titan 三個庫爲核心的綜合混合型數據庫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據悉,2015 年底,明略知識圖譜就在國內一個省會市級公安局落地,爲公安做數據分析,包括線索挖掘、團伙預警,協助公安破案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2016 年到 2017 年,任鑫琦帶領團隊探索知識圖譜在更多行業的落地和應用,目前,明略知識圖譜在公安、金融、工業和數字城市等領域得到廣泛應用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"回看大數據15年發展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2019 年,大數據進入後 Hadoop 時代,各種實時架構和組件大規模發展,大數據技術也與雲原生、人工智能深度融合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回顧大數據過去幾年的發展,任鑫琦把它概括成三個階段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"階段一,"},{"type":"text","marks":[{"type":"strong"}],"text":"大數據初期,以賣硬件和炒作概念爲主"},{"type":"text","text":"。2010 年左右,很多大型企業受市場和宣傳影響建設了大數據平臺,但沒有發揮出作用,因爲脫離業務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"階段二,"},{"type":"text","marks":[{"type":"strong"}],"text":"大數據進一步發展,以分析型爲主"},{"type":"text","text":"。2014 年,企業對大數據的認識進一步深入,通過收集更多數據,幫助業務決策。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"階段三,"},{"type":"text","marks":[{"type":"strong"}],"text":"大數據發展成熟和穩定,以實時性分析爲主"},{"type":"text","text":"。架構上,Lambda 架構和 Kappa 架構廣受歡迎, Flink、Kafka 的使用越來越廣,業務對實時性要求越來越高。“實時分析意味着實時性的決策和實時的價值,這對業務系統直接產生影響”。以銀行爲例,一個人申請貸款,是否放貸,銀行要做大數據風控,進行實時分析。因此,這個階段要求大數據的實時性更高,更輕量級的組件和更先進的技術。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任鑫琦說:“現在,大數據已經發展到一個精細化階段。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以前,人們對數據的認識是單點的、孤立的,理解很淺,比如先彙總數據,再慢慢挖掘和分析,但這可能彙總大量無效、無關的數據,這些數據對整個數據體系的業務價值會有負面影響。這些年,人們對數據有了新認識,比如數據並非越多越好,要規劃好數據怎麼存、怎麼用、怎麼產生更大價值。這就要求大數據越來越精細化和精準化!"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"寫在最後:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從2003年穀歌的“三駕馬車”到現在,大數據技術歷經十餘年發展,明略也見證了它從風口到落地再到大規模的普及應用。2007年,明略就投身大數據行業,從零到一研發出一套成熟的大數據平臺,解決了大數據存儲和大數據計算問題。此後,基於秒針系統積累的大數據能力,明略成功研發出知識圖譜平臺,並在行業裏得到廣泛應用。今天,大數據技術正與雲原生、AI技術相融合,數據驅動成爲共識,作爲行業先行者,明略一直深耕技術,從未止步,讓數據產生更大價值、發揮更大作用。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章