AIOps 在美團的探索與實踐 —— 故障發現篇

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"一、背景"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"早期的運維工作大部分是由運維人員手工完成的,手工運維在互聯網業務快速擴張、人力成本高企的時代,難以維繫。於是,自動化運維應運而生,它主要通過可被自動觸發、預定義規則的腳本,來執行常見、重複性的運維工作,從而減少人力成本,提高運維的效率。總的來說,自動化運維可以認爲是一種基於行業領域知識和運維場景領域知識的專家系統。隨着整個互聯網業務急劇膨脹,以及服務類型的複雜多樣,“基於人爲指定規則”的專家系統逐漸變得力不從心,自動化運維的不足,日益凸顯,當前美團在業務監控和運維層面也面臨着同樣的困境。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DevOps的出現,部分解決了上述問題,它強調從價值交付的全局視角,但DevOps更強調橫向融合及打通,AIOps則是DevOps在運維(技術運營)側的高階實現,兩者並不衝突。AIOps不依賴於人爲指定規則,主張由機器學習算法自動地從海量運維數據(包括事件本身以及運維人員的人工處理日誌)中不斷地學習,不斷提煉並總結規則。AIOps在自動化運維的基礎上,增加了一個基於機器學習的大腦,指揮監測系統採集大腦決策所需的數據,做出分析、決策,並指揮自動化腳本去執行大腦的決策,從而達到運維繫統的整體目標。綜上看,自動化運維水平是AIOps的重要基石,而AIOps將基於自動化運維,將AI和運維很好地結合起來,這個過程需要三方面的知識:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"行業、業務領域知識,跟業務特點相關的知識經驗積累,熟悉生產實踐中的難題。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"運維領域知識,如指標監控、異常檢測、故障發現、故障止損、成本優化、容量規劃和性能調優等。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算法、機器學習知識,把實際問題轉化爲算法問題,常用算法包括如聚類、決策樹、卷積神經網絡等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"美團技術團隊在行業、業務領域知識和運維領域的知識等方面有着長期的積累,已經沉澱出不少工具和產品,實現了自動化運維,同時在AIOps方面也有一些初步的成果。我們希望通過在AIOps上持續投入、迭代和鑽研,將之前積累的行業、業務和運維領域的知識應用到AIOps中,從而能讓AIOps爲業務研發、產品和運營團隊賦能,提高整個公司的生產效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二、技術路線規劃"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1 AIOps能力建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AIOps的建設可以先由無到局部單點探索,在單點探索上得到初步的成果,再對單點能力進行完善,形成解決某個局部問題的運維AI學件,再由多個具有AI能力的單運維能力點組合成一個智能運維流程。行業通用的演進路線如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"開始嘗試應用AI能力,還無較爲成熟的單點應用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具備單場景的AI運維能力,可以初步形成供內部使用的學件。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有由多個單場景AI運維模塊串聯起來的流程化AI運維能力,可以對外提供可靠的運維AI學件。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主要運維場景均已實現流程化免干預AI運維能力,可以對外提供供可靠的AIOps服務。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有核心中樞AI,可以在成本、質量、效率間從容調整,達到業務不同生命週期對三個方面不同的指標要求,可實現多目標下的最優或按需最優。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所謂學件,亦稱AI運維組件[1](南京大學周志華老師提出),類似程序中的API或公共庫,但API及公共庫不含具體業務數據,只是某種算法,而AI運維組件(或稱學件),則是在類似API的基礎上,兼具對某個運維場景智能化解決的“記憶”能力,將處理這個場景的智能規則保存在了這個組件中,學件(Learnware)= 模型(Model)+規約(Specification)。AIOps具體的能力框架如下圖1所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c3\/c30acbd8dcfce85e9a35e9652238ccf6.png","alt":"Image","title":"圖1 AIOps能力框架圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.2 關聯團隊建設"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"AIOps團隊內部人員根據職能可分爲三類團隊,分別爲SRE團隊、開發工程師(穩定性保障方向)團隊和算法工程師團隊,他們在AIOps相關工作中分別扮演不同的角色,三者缺一不可。SRE能從業務的技術運營中,提煉出智能化的需求點,在開發實施前能夠考慮好需求方案,產品上線後能對產品數據進行持續的運營。開發工程師負責進行平臺相關功能和模塊的開發,以降低用戶的使用門檻,提升用戶的使用效率,根據企業AIOps程度和能力的不同,運維自動化平臺開發和運維數據平臺開發的權重不同,在工程落地上能夠考慮好健壯性、魯棒性、擴展性等,合理拆分任務,保障成果落地。算法工程師則針對來自於SRE的需求進行理解和梳理,對業界方案、相關論文、算法進行調研和嘗試,完成最終算法落地方案的輸出工作,並不斷迭代優化。各團隊之間的關係圖如下圖2所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/af\/af04bbc915ca91d77066058a202e2a4c.png","alt":"Image","title":"圖2 AIOps關聯團隊關係圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.3 演進路線"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當前,我們在質量保障方面的訴求最迫切,服務運維部先從故障管理領域探索AIOps實踐。在故障管理體系中,從故障開始到結束主要有四大核心能力,即故障發現、告警觸達、故障定位、故障恢復。故障發現包含了指標預測、異常檢測和故障預測等方面,主要目標是能及時、準確地發現故障;告警觸達包含了告警事件的收斂、聚合和抑制,主要目標是降噪聚合,減少干擾;故障定位包含了數據收集、根因分析、關聯分析、智能分析等,主要目標是能及時、精準地定位故障根因;故障恢復部分包含了流量切換、預案、降級等,主要目標是及時恢復故障,減少業務損失,具體關係如下圖3所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/23\/23be404b23d90335e27cab8040c7a825.png","alt":"Image","title":"圖3 故障管理體系核心能力關係圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中在故障管理智能化的過程中,故障發現作爲故障管理中最開始的一環,在當前海量指標場景下,自動發現故障和自動異常檢測的需求甚爲迫切,能極大地簡化研發策略配置成本,提高告警的準確率,減少告警風暴和誤告,從而提高研發的效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外,時序數據異常檢測其實是基礎能力,在後續告警觸達、故障定位和故障恢復環節中,存在大量指標需要進行異常檢測。所以將故障發現作爲當前重點探索目標,解決當前海量數據場景下人工配置和運營告警策略、告警風暴和準確率不高的核心痛點。整個AIOps體系的探索和演進路線如下圖4所示。每個環節均有獨立的產品演進,故障發現-Horae(美團服務運維部與交易系統平臺部共建項目)、告警觸達-告警中心、故障定位-雷達、故障恢復-雷達預案。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f3\/f3f69a66c5a32e70bbcd51442485321f.webp","alt":"Image","title":"圖4 AIOps在故障管理方面的演進路線","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"三、AIOps之故障發現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.1 故障發現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從美團現有的監控體系可以發現,絕大多數監控數據均爲時序數據(Time Series),時序數據的監控在公司故障發現過程中扮演着不可忽視的角色。無論是基礎監控CAT[2]、MT-Falcon[3]、Metrics(App端監控),還是業務監控Digger(外賣業務監控)、Radar(故障發現與定位平臺)等,均基於時序數據進行異常監控,來判斷當前業務是否在正常運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而從海量的時序數據指標中可以發現,指標種類繁多、關係複雜(如下圖5所示)。在指標本身的特點上,有周期性、規律突刺、整體擡升和下降、低峯期等特點,在影響因素上,有節假日、臨時活動、天氣、疫情等因素。原有監控系統的固定閾值類監控策略想要覆蓋上述種種場景,變得越來越困難,並且指標數量衆多,在策略配置和優化運營上,人力成本將成倍增長。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"若在海量指標監控上,能根據指標自動適配合適的策略,不需要人爲參與,將極大的減少SRE和研發同學在策略配置和運營上的時間成本,也可讓SRE和研發人員把更多精力用在業務研發上,從而產生更多的業務價值,更好地服務於業務和用戶。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/362c62a9b2e7f1fe8ed3fd63a2b0e315.png","alt":"Image","title":"圖5 時序數據種類多樣性","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2 時序數據自動分類"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在時序數據異常檢測中,對於不同類型的時序數據,通常需要設置不同的告警規則。比如對於CPU Load曲線,往往波動劇烈,如果設置固定閾值,瞬時的高漲會經常產生誤告,SRE和研發人員需要不斷調整閾值和檢測窗口來減少誤告,當前,通過Radar(美團內部系統)監控系統提供的動態閾值策略,然後參考歷史數據可以在一定程度上避免這一情況。如果系統能夠提前預判該時序數據類型,給出合理的策略配置建議,就可以提升告警配置體驗,甚至做到自動化配置。而且在異常檢測中,時序數據分類通常也是智能化的第一步,只有實現智能化分類,才能自動適配相應的策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,時間序列分類主要有兩種方法,無監督的聚類和基於監督學習的分類。Yading[4]是一種大規模的時序聚類方法,它採用PAA降維和基於密度聚類的方法實現快速聚類,有別於K-Means和K-Shape[5]採用互相關統計方法,它基於互相關的特性提出了一個新穎的計算簇心的方法,且在計算距離時儘量保留了時間序列的形狀。對KPI進行聚類,也分爲兩種方法,一種是必須提前指定類別數目(如K-Means、K-Shape等)的方法,另一種是無需指定類別數目(如DBSCAN等),無需指定類別數目的聚類方法,類別劃分的結果受模型參數和樣本影響。至於監督學習的分類方法,經典的算法主要包括Logistics、SVM等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2.1 分類器選擇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據當前監控系統中時序數據特點,以及業內的實踐,我們將所有指標抽象成三種類別:週期型、平穩型和無規律波動型[6]。我們主要經歷了三個階段的探索,單分類器分類、多弱分類器集成決策分類和卷積神經網絡分類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"單分類器分類"},{"type":"text","text":":本文訓練了SVM、DBSCAN、One-Class-SVM(S3VM)三種分類器,平均分類準確率達到80%左右,但無規律波動型指標的分類準確率只有50%左右,不滿足使用要求。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"多弱分類器集成決策分類"},{"type":"text","text":":參考集成學習相關原理,通過對SVM、DBSCAN、S3VM三種分類器集成投票,提高分類準確率,最終分類準確率提高7個百分點,達到87%。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"卷積神經網絡分類"},{"type":"text","text":":參考對Human Activity Recognition(HAR)進行分類的實踐[7],我們用CNN(卷積神經網絡)實現了一個分類器,該分類器在時序數據分類上表現優秀,準確率能達到95%以上。CNN在訓練中會逐層學習時序數據的特徵,不需要成本昂貴的特徵工程,大大減少了特徵設計的工作量。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2.2 分類流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們選擇CNN分類器進行時序數據分類,分類過程如下圖6所示,主要步驟如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2f\/2f9a7f04aee43ff62c606b6bec06b065.png","alt":"Image","title":"圖6 時序數據分類處理流程","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"缺失值填充"},{"type":"text","text":":時序數據存在少量數據丟失或者部分時段無數據等現象,因此在分類前先對數據先進行缺失值填充。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"標準化"},{"type":"text","text":":本文采用方差標準化對時序數據進行處理。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"降維處理"},{"type":"text","text":":按分鐘粒度的話,一天有1440個點,爲了減少計算量,我們進行降維處理到144個點。PCA、PAA、SAX等一系列方法是常用的降維方法,此類方法在降低數據維度的同時還能最大程度地保持數據的特徵。通過比較,PAA在降到同樣的維度(144維)時,還能保留更多的時序數據細節,具體對比如下圖7所示。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"模型訓練"},{"type":"text","text":":使用標註的樣本數據,在CNN分類器中進行訓練,最終輸出分類模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/80056ad9015c84b97bb3e9aae63f898a.png","alt":"Image","title":"圖7 PAA、SAX降維方法對比","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.3 週期型指標異常檢測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.3.1 異常檢測方法"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於上述時序數據分類工作,本文能夠相對準確地將時序數據分爲週期型、平穩型和無規律波動型三類。在這三種類型中,週期型最爲常見,佔比30%以上,並且包含了大多數業務指標,業務請求量、訂單數等核心指標均爲週期型,所以本文優先選擇週期型指標進行自動異常檢測的探索。對於大量的時序數據,通過規則進行判斷已經不能滿足,需要通用的解決方案,能對所有周期型指標進行異常檢測,而非一個指標一套完全獨立的策略,機器學習方法是首選。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"論文Opprentice[8]和騰訊開源的Metis[9]採用監督學習的方式進行異常檢測,其做法如下:首先,進行樣本標註得到樣本數據集,然後進行特徵提取得到特徵數據集,使用特徵數據集在指定的學習系統上進行訓練,得到異常分類模型,最後把模型用於實時檢測。監督學習整體思路[10]如下圖8所示,其中(x1,y1),(x2,y2),…,(xn,yn)是訓練數據集,學習系統由訓練數據學習一個分類器P(Y∣X)或Y=f(X),分類系統通過學習到的分類器對新的輸入實例xn+1進行分類,預測其輸出的類別yn+1。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fc\/fc6fdf2a701e177de2e29561a41dbab3.webp","alt":"Image","title":"圖8 監督學習在分類問題中的應用","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.3.2 異常注入"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般而言,在樣本數據集中,正負樣本比例如果極度不均衡(比如1:5,或者更懸殊),那麼分類器分類時就會傾向於高比例的那一類樣本(假如負樣本佔較大比例,則會表現爲負樣本Recall過高,正樣本Recall低,而整體的Accuracy依然會有比較好的表現),在一個極度不均衡的樣本集中,由於機器學習會對每個數據進行學習,那麼多數數據樣本帶有的信息量就比少數樣本信息量大,會對分類器學習過程中造成干擾,導致分類不準確。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實際生產環境中,時序數據異常點是非常少見的,99%以上的數據都是正常的。如果使用真實生產環境的數據進行樣本標註,將會導致正負樣本比例嚴重失衡,導致精召率無法滿足要求。爲了解決基於監督學習的異常檢測異常點過少的問題,本文設計一種針對週期型指標的自動異常注入算法,保證異常注入足夠隨機且包含各種異常場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時序數據的異常分爲兩種基本類型,異常上漲和異常下跌,如下圖9(圖中數據使用Curve[11]標註),通常異常會持續一段時間,然後逐步恢復,恢復過程或快或慢,影響異常兩側的值,稱之爲漣漪效應(Ripple Effect),類似石頭落入水中,波紋擴散的情形。受到該場景的啓發,異常注入思路及步驟如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/97\/979db9b9d975f762ed65cd7b04431725.png","alt":"Image","title":"圖9 異常case中異常數據分佈","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"給定一段時序值S,確定注入的異常個數N,將時序數據劃分爲N塊。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"在其中的一個區域X中,隨機選定一個點Xi作爲異常種子點。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"設定異常點數目範圍,基於此範圍產生隨機出異常點數n,異常點隨機分佈在異常種子兩側,左側和右側的數目隨機產生。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"對於具體的異常點,根據其所在位置,選擇該點鄰域範圍數據作爲參考數據集m,需要鄰域在設定的範圍內隨機產生。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","text":"產生一個隨機數,若爲奇數,則爲上漲,否則下跌。基於參考數據集m,根據3Sigma原理,生成超出±3σ的數據作爲異常值。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":6,"align":null,"origin":null},"content":[{"type":"text","text":"設定一個影響範圍,在設定範圍內隨機產生影響的範圍大小,左右兩側的影響範圍也隨機分配,同時隨機產生異常衰減的方式,包括簡單移動平均、加權移動平均、指數加權移動平均三種方式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":7,"align":null,"origin":null},"content":[{"type":"text","text":"上述過程只涉及突增突降場景,而對於同時存在升降的場景,通過分別生成上漲和下跌的上述兩個異常,然後疊加在一起即可。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":8,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上面的異常注入步驟,能比較好地模擬出週期型指標在生產環境中的各種異常場景,上述過程中各個步驟的數據都是隨機產生,所以產生的異常案例各不相同,從而能爲我們生產出足夠多的異常樣本。爲了保證樣本集的高準確性,我們對於注入異常後的指標數據還會進行標註,以去除部分注入的非異常數據。具體異常數據生成效果如圖10所示,其中藍色線爲原始數據,紅色線爲注入的異常,可以看出注入異常與線上環境發生故障時相似,注入的異常隨機性較大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ce\/cec4ad0fef91f03eafa4451fa34058e6.png","alt":"Image","title":"圖10 異常注入效果圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.3.3 特徵工程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對週期型指標,經標註產生樣本數據集後,需要設計特徵提取器進行特徵提取,Opprentice中設計的幾種特徵提取器如圖11所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/64\/642995d8e4ba49882b8e7d5463fb3c09.png","alt":"Image","title":"圖11 論文Opprentice特徵提取器","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述特徵主要是一些簡單的檢測器,包括如固定閾值、差分、移動平均、SVD分解等。Metis將其分爲三種特徵,一是統計特徵,包括方差、均值、偏度等統計學特徵;二是擬合特徵,包括如移動平均、指數加權移動平均等特徵;三是分類特徵,包含一些自相關性、互相關性等特徵。參考上述提及的特徵提取方法,本文設計了一套特徵工程,區別於上述特徵提取方法,本文對提取的結果用孤立森林進行了一層特徵抽象,使得模型的泛化能力更強,所選擇的特徵及說明如下圖12所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fe\/fe477d7e01d5bc85d7a9ff40575e9891.png","alt":"Image","title":"圖12 特徵選擇及說明","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.3.4 模型訓練及實時檢測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考監督學習在分類問題中的應用思路,對週期型指標自動異常檢測方案具體設計如圖下13所示,主要分爲離線模型訓練和實時檢測兩大部分,模型訓練主要根據樣本數據集訓練生成分類模型,實時檢測利用分類模型進行實時異常檢測。具體過程說明如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"離線模型訓練"},{"type":"text","text":":基於標註的樣本數據集,使用設計的特徵提取器進行特徵提取,生成特徵數據集,通過Xgboost進行訓練,得到分類模型,並存儲。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實時檢測"},{"type":"text","text":":線上實時檢測時,時序數據先經過預檢測(降低進入特徵提取環節概率,減少計算壓力),然後根據設計的特徵工程進行特徵提取,再加載離線訓練好的模型,進行異常分類。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據反饋"},{"type":"text","text":":如果判定爲異常,將發出告警。進一步地,用戶可根據實際情況對告警進行反饋,反饋結果將加入樣本數據集中,用於定時更新檢測模型。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/8f\/8f3e37d2bf5785bb78eabf16f049bcb8.png","alt":"Image","title":"圖13 模型檢測和實時檢測流說明","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.3.5 特殊場景優化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過上述實踐,本文得到一套可完整運行的週期型指標異常檢測系統,在該系統應用到生產環境的過程中,也遇到不少問題,比如低峯期(小數值)波動幅度較大,節假日和週末趨勢和工作日趨勢完全不同,數據存在整體大幅擡升或下降,部分規律波動時間軸上存在偏移,這些情況都有可能產生誤告。本文也針對這些場景,分別提出對應的優化策略,從而減少週期型指標在這些場景下的誤告,提高異常檢測的精召率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"1)低峯期場景"},{"type":"text","text":":低峯期主要表現是小數值高波動,低峯期的波動比較普遍,但是常規檢測時,只獲取當前點前後7min的鄰域內的數據,可能無法獲取到本身已經出現過多次的較大波動,導致誤判爲異常。所以對於低峯期,需要擴大比較窗口,容納到更多的正常的較大波動場景,從而減少被誤判。如下圖14所示,紅色是當日數據,灰色是上週同日數據,如果判斷窗口爲w1,w1內藍色點有可能被認爲是異常點,而時間窗口範圍擴大到w2後,大幅波動的藍色點和綠色點都會被捕獲到,出現類似大幅波動時不再被判定爲異常,至於低峯期範圍可以通過歷史數據計算進行識別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/46\/466816e3f3ab36f341a820b775eafbc3.webp","alt":"Image","title":"圖14 低峯期時不同時段的相似大幅波動","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"2)節假日場景"},{"type":"text","text":":節假日前一天以及節假日之後一週的數據,和正常週期的趨勢都會有較大差別,可能會出現誤告。本文通過提前配置需要進行節假日檢測的日期,在設置的日期範圍內,除了進行正常的檢測流程,對於已經檢測出異常的數據點,會再進入到節假日檢測流程,都異常纔會觸發告警。節假日檢測會取最近1h的數據,分別計算其波動比、周同比、日環比等數據,當前時間的這些指標通過“孤立森林”判斷都爲異常,纔會認爲數據是真正異常。除此之外,對於節假日,模型的敏感度會適當調低以適應節假日場景。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3)整體擡升\/下降場景"},{"type":"text","text":":場景特點如下圖15所示,在該場景下,會設置一個擡升\/下跌率,比如80%,如果今天最近1h數據80%相對昨日和上週都上漲,則認爲是整體擡升,都下跌則認爲是整體下降。如果出現整體擡升情況,會降低模型敏感度,並且要求當前日環比、周同比在1h數據中均爲異常點,纔會判定當前的數據異常。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/34\/345dac0ebc2887993a7ffcd1f0eb0ea3.webp","alt":"Image","title":"圖15 整體擡升下降場景","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"4)規律波動偏移場景"},{"type":"text","text":":部分指標存在週期性波動,但是時間上會有所偏移,如圖16所示案例中時序數據由於波動時間偏移導致誤告。本文設計一種相似序列識別算法,在歷史數據中找出波動相似的序列,如果存在足夠多的相似波動序列,則認爲該波動爲正常波動。相似序列提取過程如下:最近n分鐘的時序作爲基礎序列x,獲取檢測時刻歷史14天鄰域內的數據(如前後30min),在鄰域數據中指定滑動窗口(如3min)滑動,把鄰域數據分爲多個長度爲n的序列集Y,計算基礎序列x與Y中每個序列的DTW距離,通過“孤立森林”對距離序列進行異常判斷,對於被判定爲異常值的DTW距離,它所對應的序列將被視爲相似序列。如果相似序列個數超出設定閾值,則認爲當前波動爲規律偏移波動,屬於正常現象。根據上述方法,提取到對應的相似序列如圖16右邊所示,其中紅實線爲基礎序列。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1a\/1ae8f5f932ee660b3e410ee93b6fb619.webp","alt":"Image","title":"圖16 規律波動偏移相似序列提取","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.4 異常檢測能力平臺化"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了把上述時序數據異常檢測探索的結果進行落地,服務運維部與交易系統平臺部設計和開發了時序數據異常檢測系統Horae。Horae致力於時間序列異常檢測流程的編排與調優,處理對象是時序數據,輸出是檢測流程和檢測結果,核心算法是異常檢測算法、時間序列預測算法以及針對時間序列的特徵提取算法。除此之外,Horae還會針對特殊的場景開發異常檢測算法。Horae核心能力是可根據提供的算法,編排不同的檢測流程,對指標進行自動分類,並針對指標所屬類型自動選擇合適的檢測流程,進行流程調優得到該指標下的最優參數,從而確保能適配指標並得到更高的精召率,爲各個對時序數據異常檢測有需求的團隊提供高準確率的異常檢測服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.4.1 Horae系統架構設計"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Horae系統由四個模塊組成:數據接入、實時檢測、實驗模塊和算法模塊。用戶通過數據接入模塊註冊需要監聽時序數據的消息隊列,Horae系統將監聽註冊的Topic採集時序數據,並根據粒度(例如分鐘、小時或天)更新每個時間序列,每個時序點都存儲到時序數據庫中,實時檢測模塊對每個時序點執行異常檢測,當檢測到異常時,通過消息隊列將異常信息傳輸給用戶。下圖17詳細展示了Horae系統的整體架構圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"數據接入"},{"type":"text","text":":用戶可以通過創建數據源用於數據上報,數據源可以包含一個或多個指標,指標更新頻率最小爲一分鐘。不同數據源中指標的時序數據相互隔離,時序數據更新到使用Elasticsearch改造後的時序數據庫中。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實時檢測"},{"type":"text","text":":採集到實時的時序數據後,會根據指標綁定的執行流程立即進行異常檢測,如果沒有訓練調優,會先執行訓練調優以保證更佳的精召率。實時檢測的結果會通過消息隊列通知到用戶,用戶根據異常檢測的結果進行進一步處理判斷是否需要發出告警。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"實驗模塊"},{"type":"text","text":":該模塊主要功能是樣本管理、算法註冊、流程編排、模型訓練和評估、模型發佈。該模塊提供樣本管理功能,可對樣本進行標註和存儲;對於已經實現的算法,可以註冊到系統中,供流程編排使用;通過算法編排得到的流程,可以用在模型訓練或者異常檢測中;訓練流程會使用到標註好的樣本數據調用算法離線服務進行離線訓練並存儲模型;對於已經編排好的檢測流程,可以對指標進行模擬觀察檢測異常檢測效果,或者離線迴歸判斷檢測流程在該指標上的具體精召率數據。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"算法模塊"},{"type":"text","text":":算法模塊提供了所有在實驗模塊註冊的異常檢測算法的具體實現,算法模塊既可以執行單個算法,也可以接受多個算法編排的流程進行執行。當前支持的算法大類主要有預處理算法(異常值去除、空值填充、降維、歸一化等),時序特徵算法(統計類特徵、擬合特徵、分類特徵等),機器學習類算法(RF、SVM、XGBoost、GRU、LSTM、CNN、聚類算法等),檢測類算法(孤立森林、LOF、SVM、3Sigma、四分位、IQR等),預測類算法(Ewma、Linear Weighted MA、Holt-Winters、STL、SAIMAX、Prophet等),自定義算法(形變分析、同環比、波動比等)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c9edda106a524207fda67a08905131bd.png","alt":"Image","title":"圖17 Horae時序數據異常檢測系統架構圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.4.2 算法註冊和模型編排"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"算法模型是對算法的抽象,通過唯一字符串標識算法模型,註冊算法時需要指定算法的類型、接口、參數、返回值和處理單個時序點所需要加載的時序數據配置。成功註冊的算法模型根據算法類型的不同,會生成用於模型編排的算法組件或對異常檢測模型進行訓練的組件。用於模型編排的算法組件主要包括:預處理算法、時序特徵算法、評估算法、預測算法、分類算法、異常檢測算法等,用於模型訓練的算法分爲兩大類:參數調優和機器學習模型訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"註冊後的算法模型通常不會直接用於異常檢測,會對算法模型進行編排後得到一個流程模型,流程模型可以用於執行異常檢測或者執行訓練。實驗模塊支持兩種類型的流程模型:執行流程和訓練流程。執行流程是一個異常檢測流程,指定指標和檢測時間段,得到檢測時間段每個時序點的異常分值;訓練流程是一個執行訓練的流程模型,主要包括參數調優訓練流程和機器學習模型訓練流程。使用算法進行流程編排如下圖18所示,左側菜單爲算法組件,中間區域可對算法執行流程進行編排調整,右側區域是具體算法的參數設置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/36\/369de7acd15efba4ab374c6756dfad60.png","alt":"Image","title":"圖18 流程編排示意圖","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.4.3 離線訓練和實時檢測"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在模型編排階段,可編排執行流程和訓練流程,執行流程主要用在指標實時異常檢測過程,而訓練流程主要用在離線模型訓練和參數調優。執行流程由流程配置和異常分值配置構成,由實驗模塊的流程調度引擎負責執行調度,下圖19展示了執行流程的詳細構成。流程調度引擎在對執行流程調度執行之前,會從流程的最深葉子節點的算法組件開始遞歸計算需要加載的時序數據集,根據流程中算法組件的參數配置,加載前置訓練流程的訓練結果,最後對流程中的算法組件依次調度執行,得到檢測時間段每個時序點的異常分值。最終實現後的執行流程編排如圖18所示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9e1db9b70530271adbddc0e5fc98c19c.png","alt":"Image","title":"圖19 執行流程組成和處理過程","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練流程由流程配置、訓練算法、樣本加載配置和訓練頻次等信息構成,由實驗模塊的流程調度引擎負責調度執行,下圖20展示了訓練流程的詳細構成。訓練流程主要分爲兩大類,參數調優訓練和機器學習模型訓練。參數調優訓練是指爲需要調優的參數設置參數值迭代範圍或者枚舉值,通過貝葉斯調優算法對參數進行調優,得到最優參數組合;機器學習模型訓練則通過設計好特徵工程,設置分類器和超參數範圍後調優得到機器學習模型文件。訓練流程執行訓練的樣本集來源於人工標註的樣本或者基於自動樣本構造功能生成的樣本。訓練流程編排具體過程和執行流程類似,不同的是訓練流程可設置定時任務執行,訓練的結果會存儲供後續使用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2e\/2ec1699b1bf4f3e17aae4acf569c4c42.webp","alt":"Image","title":"圖20 訓練流程組成和處理過程","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"異常檢測模型中會包含很多憑藉經驗設定的超參數,不同的指標可能需要設置不同的參數值,保證更高的精召率。而指標數據會隨着時間發生變化,設置參數需要定期的訓練和更新,在實驗模塊中可以爲訓練流程設置定時任務,實驗模塊會定時調度訓練流程生成離線訓練任務,訓練任務執行完成可以看到訓練結果和效果。下圖21示例展示了一個參數調優訓練流程的示例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b7\/b78e892594f753eea2561b4cf8ec8437.webp","alt":"Image","title":"圖21 參數調優訓練結果示例","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"3.4.4 模型案例和結果評估"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據在週期型指標上探索的結果,在Horae上編排分類模型訓練流程,訓練和測試所使用的樣本數是28000個,其中用於訓練的比例是75%,用於驗證的比例是25%,具體分類模型訓練結果如下圖22所示,在測試集上的準確率94%,召回率89%。同時編排了與之對應的執行流程,它的檢測流程除了異常分類,還主要包含了空值填充、預檢測、特徵提取、分類判斷、低峯期判斷、偏移波動判斷等邏輯,該執行流程適用範圍是週期型和穩定型指標。除此之外,還提供了流程調優能力,檢測流程中的每個算法可以暴露其超參數,對於具體的指標,通過該指標的樣本數據可以訓練得到該流程下的一組較優超參數,從而提高該指標的異常檢測的精召率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/88\/8804541e376df3c17aa4fcd4bc686004.png","alt":"Image","title":"圖22 異常分類模型訓練結果","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該異常檢測流程應用到生產環境的指標後,具體檢測效果相關案例如下圖23所示,對於週期型指標,能及時準確地發現異常,對異常點進行反饋,準確率達到90%以上。除此之外,還對比了形變分析異常檢測,對於生產環境中遇到的三個形變分析無法發現的4個案例,週期型指標異常檢測流程能發現其中3個,表現優於形變分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/38\/38a683d57a8e2ff7897cd94a89002091.png","alt":"Image","title":"圖23 週期型指標異常檢測模型生產環境檢測結果","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"四、總結與展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"時序數據異常檢測作爲AIOps中故障發現環節的核心,當前經過探索和實踐,已經在週期型指標異常檢測上取得了一定的成績,並落地到Horae時序異常檢測系統中。在時序數據異常檢測部分,後續會陸續實現平穩型、無規律波動型指標自動異常檢測,增加指標數據預測相關能力,提高檢測性能,從而實現所有類型的海量指標自動異常檢測的目的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外,在告警觸達方面,我們當前在進行告警收斂、降噪和抑制相關的規則和算法的探索,致力於提供精簡有效的信息,減少告警風暴及干擾;在故障定位方面,我們已經基於規則在定位上取得比較不錯的效果,後續還會進行更全面的定位場景覆蓋和關聯性分析、根因分析、知識圖譜相關的探索,通過算法和規則提升故障定位的精召率。因篇幅所限,告警觸達(告警中心)和故障定位(雷達)兩部分內容將會在後續的文章中詳細進行分享,敬請期待。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"五、參考資料"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[1] 周志華. 機器學習: 發展與未來[R]. 報告地: 深圳, 2016."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2] 美團實時監控系統CAT[EB\/OL]. https:\/\/tech.meituan.com\/CAT_in_Depth_Java_Application_Monitoring.html, 2018-11-01."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[3] 美團系統指標監控Mt-Falcon[EB\/OL]. https:\/\/tech.meituan.com\/Mt-Falcon_Monitoring_System.html, 2017-02-24."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[4] Ding R, Wang Q, Dang Y, et al. Yading: fast clustering of large-scale time series data[J]. Proceedings of the VLDB Endowment, 2015, 8(5): 473-484."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[5] Paparrizos J, Gravano L. k-shape: Efficient and accurate clustering of time series[C]. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1855-1870."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[6]  H. Ren, Q. Zhang, B. Xu, Y. Wang, C. Yi, C. Huang, X. Kou, T. Xing, M. Yang, and J. Tong, “Time-series anomaly detection serviceat microsoft,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 3009–3017, ACM, Jun. 2019."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[7] Tom Brander. Time series classification with Tensorflow[EB\/OL]. https:\/\/burakhimmetoglu.com\/2017\/08\/22\/time-series-classification-with-tensorflow, 2017-08-22."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[8] Liu D, Zhao Y, Xu H, et al. Opprentice: Towards practical and automatic anomaly detection through machine learning[C]\/\/Proceedings of the 2015 Internet Measurement Conference. ACM, 2015: 211-224."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[9] Metis is a learnware platform in the field of AIOps[EB\/OL]. https:\/\/github.com\/Tencent\/Metis, 2018-10-12."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[10] 李航. 統計學習方法 [M]. 第2版. 北京: 清華大學出版社, 2019.28-29."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[11] An tool to help label anomalies on time-series data[EB\/OL]. https:\/\/github.com\/baidu\/Curve, 2018-08-07."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"六、作者簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"胡原、錦冬、俊峯,來自基礎技術部-服務運維部;長偉、永強,來自到家事業羣-交易系統平臺部。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule"},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"頭圖:Unsplash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者:胡原 錦冬 俊峯等"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:https:\/\/mp.weixin.qq.com\/s\/AjE7uP7ApVPyL_HdQDkk5g"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文:AIOps在美團的探索與實踐——故障發現篇"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"來源:美團技術團隊 - 微信公衆號 [ID:meituantech]"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"轉載:著作權歸作者所有。商業轉載請聯繫作者獲得授權,非商業轉載請註明出處。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章