雲原生下的灰度體系建設

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f2/f23e75e1749d47492381c8892e9af052.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作者 | 墨封來源 | 阿里巴巴雲原生公衆號","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一週前,我們介紹了","attrs":{}},{"type":"link","attrs":{"href":"http://mp.weixin.qq.com/s?__biz=MzUzNzYxNjAzMg==&mid=2247504516&idx=1&sn=aa88827ebb8831a7db33ddc9774a7248&chksm=fae6d94bcd91505d46de40af4cf3984f9531ff9b491c6055d511297f473b41e3a59196c48f24&scene=21#wechat_redirect","title":"","type":null},"content":[{"type":"text","text":"《面對大規模 K8s 集羣,如何先於用戶發現問題》","attrs":{}}]},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本篇文章,我們將繼續爲大家介紹 ASI SRE(ASI,Alibaba Serverless infrastructure,阿里巴巴針對雲原生應用設計的統一基礎設施) 是如何探索在 Kubernetes 體系下,建設 ASI 自身基礎設施在大規模集羣場景下的變更灰度能力的。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"我們面臨着什麼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ASI 誕生於阿里巴巴集團全面上雲之際,承載着集團大量基礎設施全面雲原生化的同時,自身的架構、形態也在不斷地演進。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ASI 整體上主要採用 Kube-on-Kube 的架構,底層維護了一個核心的 Kubernetes 元集羣,並在該集羣部署各個租戶集羣的 master 管控組件:apiserver、controller-manager、scheduler,以及 etcd。而在每個業務集羣中,則部署着各類 controller、webhook 等 addon 組件,共同支撐 ASI 的各項能力。而在數據面組件層面,部分 ASI 組件以 DaemonSet 的形式部署在節點上,也有另一部分採用 RPM 包的部署形式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/78/78aa98fd07f68ebfef2b6b3a8c78e472.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時,ASI 承載了集團、售賣區場景下數百個集羣,幾十萬的節點。即便在 ASI 建設初期,其管轄的節點也達到了數萬的級別。在 ASI 自身架構快速發展的過程中,組件及線上變更相當頻繁,早期時單日 ASI 的組件變更可以達到數百次。而 ASI 的核心基礎組件諸如 CNI 插件、CSI 插件、etcd、Pouch 等,無論任意之一的錯誤變更都可能會引起整個集羣級別的故障,造成上層業務不可挽回的損失。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/51/518732380e88c995c9da1caa7dd82025.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡而言之,集羣規模大、組件數量多,變更頻繁以及業務形態複雜是在 ASI,或其他 Kubernetes 基礎設施層建設灰度能力和變更系統的幾大嚴峻挑戰。當時在阿里巴巴內部,ASI/Sigma 已有數套現有的變更系統,但都存在一定的侷限性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"天基:具備通用的節點發布的能力,但不包括集羣、節點集等 ASI 的元數據信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"UCP:早期 sigma 2.0 的發佈平臺,年久失修。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"sigma-deploy:sigma 3.x 的發佈平臺,以鏡像 patch 的形式更新 deployment/daemonset。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"asi-deploy:早期 ASI 的發佈平臺,管理了 ASI 自身的組件,僅支持鏡像 patch,只針對 Aone 的 CI/CD 流水線做適配,以及支持在多個不同環境間灰度,但灰度粒度較粗。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由此,我們希望借鑑前面幾代 sigma/ASI 的發佈平臺歷史,從變更時入手,以系統能力爲主,再輔以流程規範,逐步構建 ASI 體系下的灰度體系,建設 Kubernetes 技術棧下的運維變更平臺,保障數以千計的大規模集羣的穩定性。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"預設和思路","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ASI 自身架構和形態的發展會極大地影響其自身的灰度體系建設方式,因此在 ASI 發展的早期,我們對 ASI 未來的形態做了如下大膽的預設:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"以 ACK 爲底座","attrs":{}},{"type":"text","text":":ACK(阿里雲容器服務)提供了雲的各種能力,ASI 將基於複用這些雲的能力,同時將阿里巴巴集團內積累的先進經驗反哺雲。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集羣規模大","attrs":{}},{"type":"text","text":":爲提高集羣資源利用率,ASI 將會以大集羣的方式存在,單個集羣提供公共資源池來承載多個二方租戶。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集羣數量多","attrs":{}},{"type":"text","text":":ASI 不僅按 Region 維度進行集羣劃分,還會按照業務方等維度劃分獨立的集羣。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Addon 數量多","attrs":{}},{"type":"text","text":":Kubernetes 體系是一個開放架構,會衍生出非常多 operator,而這些 operator 會和 ASI 核心組件一起共同對外提供各種能力。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":5,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"變更場景複雜","attrs":{}},{"type":"text","text":":ASI 的組件變更場景將不止鏡像發佈形式,Kubernetes 聲明式的對象生命週期管理註定了變更場景的複雜性。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於以上幾個假設,我們能夠總結在 ASI 建設初期,亟待解決的幾個問題:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"如何在單個大規模集羣中建設變更的灰度能力?","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"如何在多個集羣間建立規模化的變更灰度能力?","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"在組件數量、種類衆多的情況下,如何保證進行組件管理並保證組件每次的發佈不會影響線上環境?","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/81/8175c7158ff722f910adafe269566359.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們轉換一下視角,脫離集羣的維度,嘗試從組件的角度來解決變更的複雜性。對於每個組件,它的生命週期可以大體劃分爲需求和設計階段,研發階段和發佈階段。對於每個階段我們都希望進行規範化,並解決 Kubernetes 本身的特點,將固定的規範落到系統中,以系統能力去保證灰度過程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結合 ASI 的形態和變更場景的特殊性,我們從以下幾點思路出發去系統化建設 ASI 的灰度體系:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"需求和設計階段","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"方案 TechReview","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"組件上線變更會審","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"組件研發階段","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"標準化組件研發流程","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"組件發佈變更階段","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提供組件工作臺能力進行組件的規模化管理","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"建設 ASI 元數據,細化灰度單元","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":1,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"建設 ASI 單集羣、跨集羣的灰度能力","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"灰度體系建設","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"1. 研發流程標準化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ASI 核心組件的研發流程可以總結爲以下幾個流程:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對 ASI 自身的核心組件,我們與質量技術團隊的同學共同建設了 ASI 組件的 e2e 測試流程。除了組件自身的單元測試、集成測試外,我們單獨搭建了單獨的 e2e 集羣,用作常態化進行的 ASI 整體的功能性驗證和 e2e 測試。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b1/b1060f03bc3db448678380a27149a768.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從單個組件視角入手,每個組件的新功能經過研發後,進行 Code Review 通過併合入 develop 分支,則立即觸發進行 e2e 流程,通過 chorus(雲原生測試平臺) 系統構建鏡像後,由 ASIOps(ASI 運維管控平臺) 部署到對應的 e2e 集羣,執行標準的 Kubernetes Conformance 套件測試任務,驗證 Kubernetes 範圍內的功能是否正常。僅當所有測試 case 通過,該組件的版本纔可標記爲可推平版本,否則後續的發佈將會受到管控限制。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而正如上文提到,Kubernetes 開放的架構意味着它不僅僅包含管控、調度等核心組件,集羣的功能還很大程度依賴於上層的 operator 來共同實現。因此 Kubernetes 範圍內的白盒測試並不能覆蓋所有的 ASI 的適用場景。底層組件功能的改變很有大程度會影響到上層 operator 的使用,因此我們在白盒 Conformance 的基礎上增加了黑盒測試用例,它包含對各類 operator 自身的功能驗證,例如從上層 paas 發起的擴縮容,校驗發佈鏈路的 quota 驗證等能力,常態化運行在集羣中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fd/fd83571dcdcd9f5337e8d8b8a2fbe6a1.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2. 組件規模化管理","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對 ASI 組件多、集羣多的特點,我們在原有 asi-deploy 功能之上進行拓展,以組件爲切入點,增強組件在多集羣間的管理能力,從","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"鏡像管理","attrs":{}},{"type":"text","text":"演進成了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"YAML 管理","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/38/38695832916196624cb00112ddcaf3b0.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於 Helm Template 的能力,我們將一個組件的 YAML 抽離成模板、鏡像和配置三部分,分別表示以下幾部分信息:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模板:YAML 中在所有環境固定不變的信息,例如 apiVersion,kind 等;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鏡像:YAML 中與組件鏡像相關的信息,期望在單一環境或者所有集羣中保持一致的信息;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"配置:YAML 中與單環境、單集羣綁定的信息,允許存在多樣化的內容,不同集羣中的配置可能不同;","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,一個完整的 YAML 則由模板、鏡像和配置共同渲染而成。而 ASIOps 則再會對鏡像信息和配置信息這部分 YAML 分別進行集羣維度和時間維度(多版本)進行管理,計算組件當前版本信息在衆多集羣衆多分佈狀況以及組件在單集羣中版本的一致性狀況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對鏡像版本,我們從系統上促使其版本統一,以保證不會因版本過低而導致線上問題;而針對配置版本,我們則從管理上簡化它的複雜性,防止配置錯誤發入集羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a7/a7e020659fa91d8e7cf18473efec6229.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了組件的基礎原型後,我們希望發佈不僅僅是“替換 workload 裏的 image 字段”這樣簡單的一件事。我們當前維護了整個 YAML 信息,包含了除了鏡像之外的其他配置內容,需要支持除了鏡像變動外的變更內容。因此我們嘗試以儘可能接近 kubectl apply 的方式去進行 YAML 下發。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們會記錄三部分的 YAML Specification 信息:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cluster Spec:當前集羣中指定資源的狀況;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Target Spec:現在要發佈進集羣的 YAML 信息;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DB Spec:上一次部署成功的 YAML 信息,與 kubectl apply 保存在 annotation 中的 last-applied-configuration 功能相同。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d6/d65086be4d6ac7db54b7b837d13d0882.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一個由鏡像、配置和模板共同構建的 YAML,我們會採集上述三種 Spec 信息,並進行一次 diff,從而獲得到資源 diff patch,再進行一次 filter out,篩去不允許變更的危險的字段,最後將整體的 patch 以 strategic merge patch 或者 merge patch 的形式發送給 APIServer,觸發使得 workload 重新進入 reconcile 過程,以改變集羣中該 workload 的實際狀況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外,由於 ASI 組件之間具有較強的相關性,存在許多場景需要同時一次性發布多個組件。例如當我們初始化一個集羣,或者對集羣做一次整體的 release 時。因此我們在單個組件部署的基礎上增加了 Addon Release 的概念,以組件的集合來表明整個 ASI 的 release 版本,並且根據每個組件的依賴關係自動生成部署流,保證整體發佈的過程中不會出現循環依賴。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a5/a52b2feeb8a447d97dfa156e3a11c5d2.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3. 單集羣灰度能力建設","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在雲原生的環境下,我們以終態的形式去描述應用的部署形態,而 Kubernetes 提供了維護各類 Workload 終態的能力,Operator 對比 workload 當前狀態與終態的差距並進行狀態協調。這個協調的過程,換言之 workload 發佈或者回滾的過程,可以由 Operator 定義的發佈策略來處理這個“面向終態場景內的面向過程的流程”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比 Kubernetes 上層的應用負載,底層的基礎設施組件在發佈的過程中更關心組件自身的灰度發佈策略和灰度暫停能力,","attrs":{}},{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"即不論任何類型的組件,都需要能在發佈過程中具備及時停止發佈的能力,以提供更多的時間進行功能檢測、決策以及回滾","attrs":{}},{"type":"text","text":"。具體而言,這些能力可以歸納爲如下幾類:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"updateStrategy:流式升級/滾動升級","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"pause/resume:暫停/恢復能力","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"maxUnavailable:不可用副本數到達一定時能夠快速停止升級","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"partition:升級暫停能力,單次僅升級固定數量副本數,保留一定數量的老版本副本","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ASI 中針對 Kubernetes 原生 workload 能力、節點能力都進行了增強。依託於集羣中 Kruise 和 KubeNode 這類 operator 的能力以及上層管控平臺 ASIOps 的共同協作,我們對 Kubernetes 基礎設施組件實現了上述灰度能力的支持。對於 Deployment / StatefulSet / DaemonSet / Dataplane 類型的組件,在單集羣中發佈時支持的能力如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c4/c4f49dcd764ada602f41b1767245a341.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後文將簡要介紹我們針對不同 Workload 類型的組件進行灰度的實現,詳細的實現細節可以關注我們開源的項目 OpenKruise 以及後續準備開源的 KubeNode。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"1)Operator Platform","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大多數 Kubernetes 的 operator 以 Deployment 或者 StatefulSet 的方式部署,在 Operator 發佈的過程中,一旦鏡像字段變動,所有 Operator 副本均會被升級。這個過程一旦新版本存在問題,則會造成不可挽回的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對此類 operator,我們將 controller-runtime 從 operator 中剝離出來,構建一箇中心化的組件 operator-manager(OpenKruise 開源實現中爲 controller-mesh)。同時每個 operator pod 中會增加一個 operator-runtime 的 sidecar 容器,通過 gRPC 接口爲組件的主容器提供 operator 的核心能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0a/0a2fb980ef947bcc98ef7639fe76a972.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"operator 向 APIServer 建立 Watch 連接後,監聽到事件並被轉化爲待 operator 協調處理的任務流(即 operator 的流量),operator-manager 負責中心化管控所有 operator 的流量,並根據規則進行流量分片,分發到不同的 operator-runtime,runtime 中的 workerqueue 再觸發實際 operator 的協調任務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在灰度過程中,operator-manager 支持按照 namespace 級別,哈希分片方式,將 operator 的流量分攤給新舊版本的兩個副本,從而可以從兩個副本處理的負載 workload 來驗證這次灰度發佈是否存在問題。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"2)Advanced DaemonSet","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"社區原生的 DaemonSet 支持了 RollingUpdate,但是其滾動升級的能力上僅支持 maxUnavailable 一種,這對於單集羣數千上萬節點的 ASI 而言是無法接受的,一旦更新鏡像後所有 DaemonSet Pod 將會被升級,並且無法暫停,僅能通過 maxUnavailable 策略進行保護。一旦 DaemonSet 發佈了一個 Bug 版本,並且進程能夠正常啓動,那麼 maxUnavailable 也無法生效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外社區提供 onDelete 方式,可以在手動刪除 Pod 創建新 Pod,由發佈平臺中心端控制發佈順序和灰度,這種模式無法做到單集羣中的自閉環,所有的壓力都上升到發佈平臺上。讓上層發佈平臺來進行Pod驅逐,風險比較大。最好的方式就是 Workload 能自閉環提供組件更新的能力。因此我們在 Kruise 中加強了 DaemonSet 的能力使其支持上述幾種重要的灰度能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下是一個基本的 Kruise Advanced DaemonSet 的例子:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"apiVersion: apps.kruise.io/v1alpha1\nkind: DaemonSet\nspec:\n # ...\n updateStrategy:\n type: RollingUpdate\n rollingUpdate:\n maxUnavailable: 5\n partition: 100\n paused: false\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 partition 意爲保留老版本鏡像的 Pod 副本數,滾升級過程中一旦指定副本數 Pod 升級完成,將不再對新的 Pod 進行鏡像升級。我們在上層 ASIOps 中控制 partition 的數值來滾動升級 DaemonSet,並配合其他 UpdateStrategy 參數來保證灰度進度,同時在新創建的 Pod 上進行一些定向驗證。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/79/79717dcd5de5cfc0097cae2e412d00be.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"3)MachineComponentSet","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MachineComponentSet 是 KubeNode 體系內的 Workload,ASI 中在 Kubernetes 之外的節點組件(無法用 Kubernetes 自身的 Workload 發佈的組件),例如 Pouch,Containerd,Kubelet 等均是通過該 Workload 進行發佈。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"節點組件以 Kubernetes 內部的自定義資源 MachineComponent 進行表示,包含一個指定版本的節點組件(例如 pouch-1.0.0.81)的安裝腳本,安裝環境變量等信息;而 MachineComponentSet 則是節點組件與節點集合的映射,表明該批機器需要安裝該版本的節點組件。而中心端的 Machine-Operator 則會去協調這個映射關係,以終態的形式,比對節點上的組件版本以及目標版本的差異,並嘗試去安裝指定版本的節點組件。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/78/789e02212ccfa9ef38d636ce7eb1b42a.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在灰度發佈這一部分,MachineComponentSet 的設計與 Advanced DaemonSet 類似,提供了包括 partition,maxUnavailable 的 RollingUpdate 特性,例如以下是一個 MachineComponentSet 的示例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"apiVersion: kubenode.alibabacloud.com/v1\nkind: MachineComponentSet\nmetadata:\n labels:\n alibabacloud.com/akubelet-component-version: 1.18.6.238-20201116190105-cluster-202011241059-d380368.conf\n component: akubelet\n name: akubelet-machine-component-set\nspec:\n componentName: akubelet\n selector: {}\n updateStrategy:\n maxUnavailable: 20%\n partition: 55\n pause: false\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣上層 ASIOps 在控制灰度升級節點組件時,與集羣側的 Machine-Operator 進行交互,修改指定 MachineComponentSet 的 partition 等字段進行滾動升級。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於傳統的節點組件發佈模式,KubeNode 體系將節點組件的生命週期也閉環至 Kubernetes 集羣內,並將灰度發佈的控制下沉到集羣側,減少中心側對節點元數據管理的壓力。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4. 跨集羣灰度能力建設","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里巴巴內部針對雲產品、基礎產品制定了變更紅線 3.0,對管控面組件、數據面組件的變更操作的分批灰度、控制間隔、可觀測、可暫停、可回滾進行了要求。但變更對象以 region 的單元進行灰度不滿足 ASI 的複雜場景,因此我們嘗試去細化 ASI 上管控面、數據面的變更所屬的變更單元的類型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們圍繞集羣這一基礎單元向上,向下分別進行抽象,得到以下幾個基本單元:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集羣組","attrs":{}},{"type":"text","text":":具有共同業務方(ASI 承接的二方用戶)、網絡域(售賣區/OXS/集團)、環境(e2e/測試/預發/金絲雀/小流量/生產)信息,因此在監控、告警、巡檢、發佈等方面的配置具有共同性。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"集羣","attrs":{}},{"type":"text","text":":ASI 集羣概念,對應一個 Kubernetes 集羣預案。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"節點集","attrs":{}},{"type":"text","text":":一組具有共同特徵的節點集合,包括資源池、子業務池等信息。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Namespace","attrs":{}},{"type":"text","text":":單個集羣中的單個 Namespace,通常 ASI 中一個上層業務對應一個 Namespace。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"節點","attrs":{}},{"type":"text","text":":單臺宿主機節點,對應一個 Kubernetes Node。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ba/ba59c2b889e61efad07e3f5e40459b78.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對每種發佈模式(管控組件、節點組件),我們以最小爆炸半徑爲原則,將他們所對應的灰度單元編排串聯在一起,以使得灰度流程能夠固化到系統中,組件開發在發佈中必須遵守流程,逐個單元進行部署。編排過程中,我們主要考慮以下幾個因素:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業務屬性","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"環境(測試、預發、小流量、生產)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"網絡域(集團 V、售賣區、OXS)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"集羣規模(Pod/Node 數)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用戶屬性(承載用戶的 GC 等級)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單元/中心","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"組件特性","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時我們對每個單元進行權重打分,並對單元間的依賴關係進行編排。例如以下是一條 ASI 監控組件的發佈流水線,由於該監控組件在所有 ASI 場景都會使用同一套方案,它將推平至所有 ASI 集羣。並且在推平過程中,它首先會經過泛電商交易集羣的驗證,再進行集團 VPC 內二方的發佈,最後進行售賣區集羣的發佈。而在每個集羣中,該組件則會按照上一節中我們討論的單集羣內的灰度方式進行 1/5/10 批次的分批,逐批進行發佈。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2d/2d0c030630c8269c7c9675fe9874274f.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進行了灰度單元編排之後,我們則可以獲得到一次組件推平流水線的基礎骨架。而對於骨架上的每個灰度單元,我們嘗試去豐富它的前置檢查和後置校驗,從而能夠在每次發佈後確認灰度的成功性,並進行有效的變更阻斷。同時對於單個批次我們設置一定的靜默期去使得後置校驗能夠有足夠的時間運行完,並且提供給組件開發足夠的時間進行驗證。目前單批次前置後置校驗內容包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全局風險規則(封網、熔斷等)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"發佈時間窗口(ASI 試行週末禁止發佈的規則)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"KubeProbe 集羣黑盒探測","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"金絲雀任務(由諾曼底發起的 ASI 全鏈路的擴縮容任務)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心監控指標大盤","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"組件日誌(組件 panic 告警等)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"主動診斷任務(主動查詢對應的監控信息是否在發佈過程中有大幅變化)","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/47/47064b7dcdfbdd924004882240e65677.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將整個多集羣發佈的流程串聯在一起,我們可以得到一個組件從研發,測試至上線發佈,整個流程經歷的事件如下圖:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/02/020ea355542ed94a65270d451351c139.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在流水線編排的實現方面,我們對社區已有的 tekton 和 argo 進行了選型調研,但考慮到我們在發佈流程中較多的邏輯不適合單獨放在容器中執行,同時我們在發佈過程中的需求不僅僅是 CI/CD,以及在設計初期這兩個項目在社區中並不穩定。因而我們參考了 tekton 的基礎設計(task / taskrun / pipeline / pipelinerun)進行了實現,並且保持着和社區共同的設計方向,在未來會調整與社區更接近,更雲原生的方式。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"成果","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過近一年半的建設,ASIOps 目前承載了近百個管控集羣,近千個業務集羣(包括 ASI 集羣、Virtual Cluster 多租虛擬集羣,Sigma 2.0 虛擬集羣等),400 多個組件(包括 ASI 核心組件、二方組件等)。同時 ASIOps 上包含了近 30 餘條推平流水線,適用於 ASI 自身以及 ASI 承載的業務方的不同發佈場景。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時每天有近 400 次的組件變更(包括鏡像變更和配置變更),通過流水線推平的此時達 7900+。同時爲了提高發布效率,我們在前後置檢查完善的條件下開啓了單集羣內自動灰度的能力,目前該能力被大多數 ASI 數據面的組件所使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如下是一個組件通過 ASIOps 進行版本推平的示例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f8/f8560e85910d65deb99cca9f57d76aa8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時我們在 ASIOps 上的分批灰度以及後置檢查變更阻斷,也幫助我們攔住了一定由於組件變更引起的故障。例如 Pouch 組件在進行灰度時,由於版本不兼容導致了集羣不可用,通過發佈後觸發的後置巡檢發現了這一現象,並阻斷了灰度進程。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f7/f75dc298836985b2eb0de399ba3c8066.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ASIOps 上的組件大多數都是 ASI/Kubernetes 底層的基礎設施組件,近一年半以來沒有因爲由組件變更所引起的故障。我們努力將指定的規範通過系統能力固化下來,以減少和杜絕違反變更紅線的變更,從而將故障的發生逐步右移,從變更引發的低級故障逐步轉變至代碼 Bug 自身引起的複雜故障。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"展望","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着 ASI 的覆蓋的場景逐步擴大,ASIOps 作爲其中的管控平臺需要迎接更復雜的場景,規模更大的集羣數、組件數的挑戰。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們亟待解決穩定性和效率這一權衡問題,當 ASIOps 納管的集羣數量到達一定量級後,進行一次組件推平的耗時將相當大。我們希望在建設了足夠的前後置校驗能力後,提供變更全託管的能力,由平臺自動進行發佈範圍內的組件推平,並執行有效的變更阻斷,在 Kubernetes 基礎設施這一層真正做到 CI/CD 自動化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時目前我們需要手動對灰度單元進行編排,確定灰度順序,在未來我們希望建設完全整個 ASI 的元數據,並自動對每次發佈範圍內的所有單元進行過濾、打分和編排。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,目前 ASIOps 暫時只做到針對組件相關的變更進行灰度的能力,而 ASI 範圍內的變更遠不止組件這一點。灰度體系應該是一個通用的範疇,灰度流水線需要被賦能到注入資源運維、預案執行的其他的場景中。此外,整個管控平臺的灰度能力沒有與阿里巴巴有任何緊耦合,完全基於 Kruise / KubeNode 等 Workload 進行打造,未來我們會探索開源整套能力輸出到社區中。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章