抖音春晚活動背後的 Service Mesh 流量治理技術

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"背景與挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2021 年的央視春晚紅包項目留給業務研發同學的時間非常少,他們需要在有限的時間內完成相關代碼的開發測試以及上線。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個項目涉及到不同的技術團隊,自然也會涉及衆多的微服務。這些微服務有各自的語言技術棧,包括 Go,C++,Java,Python,Node 等,同時又運行在非常複雜的環境中,比如容器、虛擬機、物理機等。這些微服務在整個抖音春晚活動的不同階段,可能又需要使用不同的流量治理策略來保證穩定性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此基礎架構就需要爲這些來自不同團隊、用不同語言編寫的微服務提供統一的流量治理能力。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"傳統微服務架構的應對"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"說到微服務,我們先來看一下傳統的微服務架構是怎麼解決這些問題的。隨着企業組織的不斷髮展,產品的業務邏輯日漸複雜,爲了提升產品的迭代效率,互聯網軟件的後端架構逐漸從單體的大服務演化成了分佈式微服務。分佈式架構相對於單體架構,其穩定性和可觀測性要差一些。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了提升這些點,我們就需要在微服務框架上實現很多功能。例如:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微服務需要通過相互調用來完成原先單體大服務所實現的功能,這其中就涉及到相關的"},{"type":"text","marks":[{"type":"strong"}],"text":"網絡通信"},{"type":"text","text":",以及網絡通信帶來的"},{"type":"text","marks":[{"type":"strong"}],"text":"請求的序列化、響應的反序列化"},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"服務間的相互調用涉及"},{"type":"text","marks":[{"type":"strong"}],"text":"服務發現"},{"type":"text","text":"。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式的架構可能需要不同的"},{"type":"text","marks":[{"type":"strong"}],"text":"流量治理策略"},{"type":"text","text":"來保證服務之間相互調用的穩定性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微服務架構下還需要提升可觀測性能力,包括日誌、監控、Tracing 等。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過實現以上這些功能,微服務架構也能解決前面提到的一些問題。但是微服務本身又存在一些問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在多語言的微服務框架上實現多種功能,涉及的"},{"type":"text","marks":[{"type":"strong"}],"text":"開發和運維成本非常高"},{"type":"text","text":";"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微服務框架上一些新 Feature 的交付或者版本召回,需要業務研發同學配合進行相關的改動和發佈上線,會造成微服務框架的"},{"type":"text","marks":[{"type":"strong"}],"text":"版本長期割裂不受控"},{"type":"text","text":"的現象。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那我們怎麼去解決這些問題呢?在軟件工程的領域有這樣一句話:任何問題都可以通過增加一箇中間層去解決。而針對我們前面的問題,業界已經給出了答案,這個中間層就是 Service Mesh(服務網格)。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"自研 Service Mesh 實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面就給大家介紹一下火山引擎自研 Service Mesh 的實現。先看下面這張架構圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c0\/c0f9297a04e8d7482c5b8510ba98ed48.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中藍色矩形的 Proxy 節點是 Service Mesh 的數據面,它是一個單獨的進程,和運行着業務邏輯的 Service 進程部署在同樣的運行環境(同一個容器或同一臺機器)中。由這個 Proxy 進程來代理流經 Service 進程的所有流量,前面提到的需要在微服務框架上實現的服務發現、流量治理策略等功能就都可以由這個數據面進程完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中的綠色矩形是 Service Mesh 的控制面。我們需要執行的路由流量、治理策略是由這個控制面決定的。它是一個部署在遠端的服務,由它和數據面進程下發一些流量治理的規則,然後由數據面進程去執行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時我們也可以看到數據面和控制面是與業務無關的,其發佈升級相對獨立,不需要通知業務研發同學。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於這樣的架構就可以解決前文提到的一些問題:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們不需要把微服務框架衆多的功能在每種語言上都實現一遍,只需要在 Service Mesh 的數據面進程中實現即可;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時由數據面進程屏蔽各種複雜的運行環境,Service 進程只需要和數據面進程通訊即可;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"各種靈活多變的流量治理策略也都可以由 Service Mesh 的進程控制面服務進行定製。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Service Mesh 流量治理技術"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來給大家介紹我們的 Service Mesh 實現具體提供了哪些流量治理技術來保障微服務在面對抖音春晚活動的流量洪峯時能夠有一個比較穩定的表現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先介紹一下流量治理的核心:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"路由"},{"type":"text","text":":流量從一個微服務實體出發,可能需要進行一些服務發現或者通過一些規則流到下一個微服務。這個過程可以衍生出很多流量治理能力。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安全"},{"type":"text","text":":流量在不同的微服務之間流轉時,需要通過身份認證、授權、加密等方式來保障流量內容是安全、真實、可信的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"控制"},{"type":"text","text":":在面對不同的場景時,用動態調整治理策略來保障微服務的穩定性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"可觀測性"},{"type":"text","text":":這是比較重要的一點,我們需要對流量的狀態加以記錄、追蹤,並配合預警系統及時發現並解決問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/12\/12d1fad56f847bdf9bd558f8c7224abf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上的四個核心方面配合具體的流量治理策略,可以提升微服務的穩定性,保障流量內容的安全,提升業務同學的研發效率,同時在面對黑天鵝事件的時候也可以提升整體的容災能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們繼續來看一下 Service Mesh 技術具體都提供了哪些流量治理策略來保障微服務的穩定性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"穩定性策略——熔斷"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是熔斷。在微服務架構中,單點故障是一種常態。當出現單點故障的時候,如何保障整體的成功率是熔斷需要解決的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"熔斷可以從客戶端的視角出發,記錄從服務發出的流量請求到達下游中每一個節點的成功率。當請求達到下游的成功率低於某一閾值,我們就會對這個節點進行熔斷處理,使得流量請求不再打到故障節點上。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2d\/2dcb5f7da46fbfe3191a29cd46a2c70c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當故障節點恢復的時候,我們也需要一定的策略去進行熔斷後的恢復。比如可以嘗試在一個時間週期內發送一些流量打到這個故障節點,如果該節點仍然不能提供服務,就繼續熔斷;如果能夠提供服務了,就逐漸加大流量,直到恢復正常水平。通過熔斷策略,可以容忍微服務架構中個別節點的不可用,並防止進一步惡化帶來的雪崩效應。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"穩定性策略——限流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一個治理策略是限流。限流是基於這樣的一個事實:Server 在過載狀態下,其請求處理的成功率會降低。比如一個 Server 節點正常情況下能夠處理 2000 QPS,在過載情況下(假設達到 3000 QPS),這個 Server 就只能處理 1000 QPS 甚至更低。限流可以主動 drop 一些流量,使得 Server 本身不會過載,防止雪崩效應。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"穩定性策略——降級"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Server 節點進一步過載,就需要使用降級策略。降級一般有兩種場景:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一種是按照比例丟棄流量。比如從 A 服務發出到 B 服務的流量,可以按照一定的比例(20% 甚至更高)丟棄。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一種是旁路依賴的降級。假設 A 服務需要依賴 B、C、D 3 個服務,D 是旁路,可以把旁路依賴 D 的流量掐掉,使得釋放的資源可以用於核心路徑的計算,防止進一步過載。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/12\/12b61049b8deeb90a0244513039c59a8.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"穩定性策略——動態過載保護"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"熔斷、限流、降級都是針對錯誤發生時的治理策略,其實最好的策略是防患於未然,也就是接下來要介紹的動態過載保護。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面提到了限流策略很難確定閾值,一般是通過壓測去觀測一個節點能夠承載的 QPS,但是這個上限量級可能會由於運行環境的不同,在不同節點上的表現也不同。動態過載保護就是基於這樣一個事實:資源規格相同的服務節點,處理能力不一定相同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何實現動態過載保護?它分爲三個部分:過載檢測,過載處理,過載恢復。其中最關鍵的是如何判斷一個 Server 節點是否過載。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/39\/3975433e599b29a95e8e0c480c353add.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中的 Ingress Proxy 是 Service Mesh 的數據面進程,它會代理流量併發往 Server 進程。圖中的 T3 可以理解爲從 Proxy 進程收到請求到 Server 處理完請求後返回的時間。這個時間是否可以用來判斷過載?答案是不能,因爲 Server 有可能依賴於其他節點。有可能是其他節點的處理時間變長了,導致 Server 的處理時間變長,這時 T3 並不能反映 Server 是處於過載的狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中 T2 代表的是數據面進程把請求轉發到 Server 後,Server 真正處理到它的時間間隔。T2 能否反映過載的狀態?答案是可以的。爲什麼可以?舉一個例子,假設 Server 的運行環境是一個 4 核 8g 的實例,這就決定了該 Server 最多隻能同時處理 4 個請求。如果把 100 個請求打到該 Server,剩餘的 96 個請求就會處於 pending 的狀態。當 pending 的時間過長,我們就可以認爲是過載了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢測到 Server 過載之後應當如何進行處理?針對過載處理也有很多策略,我們採用的策略是根據請求的優先級主動 drop 低優的請求,以此來緩解 Server 過載的情況。當 drop 了一些流量後 Server 恢復了正常水平,我們就需要進行相應的過載恢復,使得 QPS 能夠達到正常狀態。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個過程是如何體現動態性的?過載檢測是一個實時的過程,它有一定的時間週期。在每一個週期內,當檢測到 Server 是過載的狀態,就可以慢慢根據一定比例 drop 一些低優請求。在下一個時間週期,如果檢測到 Server 已經恢復了,又會慢慢調小 drop 的比例,使 Server 逐漸恢復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態過載保護的效果是非常明顯的:它可以保證服務在大流量高壓的情況下不會崩潰,該策略也廣泛地應用於抖音春晚紅包項目中的一些大服務。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"穩定性策略——負載均衡"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們看一下負載均衡策略。假設有一個服務 A 發出的流量要達到下游服務 B,A 和 B 都有一萬個節點,我們如何保障從 A 出發的流量達到 B 中都是均衡的?做法其實有很多,比較常用的是隨機輪詢、加權虛機、加權輪詢,這些策略其實看名字就能知道是什麼意思了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種比較常見的策略是一致性哈希。哈希是指根據請求的一些特徵使得請求一定會路由到下游中的相同節點,將請求和節點建立起映射關係。一致性哈希策略主要應用於緩存敏感型服務,可以大大提升緩存的命中率,同時提升 Server 性能,降低超時的錯誤率。當服務中有一些新加入的節點,或者有一些節點不可用了,哈希的一致性可以儘可能少地影響已經建立起的映射關係。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有很多其他的負載均衡策略,在生產場景中的應用範圍並不是很廣泛,這裏不再贅述。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"穩定性策略——節點分片"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"面對抖音春晚紅包這種超大流量規模的場景,還有一個比較有用的策略是節點分片。節點分片基於這樣一個事實:節點多的微服務,其長連接的複用率是非常低的。因爲微服務一般是通過 TCP 協議進行通信,需要先建立起 TCP 連接,流量流轉在 TCP 連接上。我們會盡可能地複用一個連接去發請求搜響應,以避免因頻繁地進行連接、關閉連接造成的額外開銷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當節點規模非常大的時候,比如說 Service A 和 Service B 都有 1 萬個節點,它們就需要維持非常多的長連接。爲避免維持這麼多長連接,通常會設置一個 idle timeout 的時間,當一個連接在一定的間隔內沒有流量經過的時候,這個連接就會被關掉。在服務節點規模非常大的場景下,長連接退化成的短連接,會使得每一個請求都需要建立連接才能進行通訊。它帶來的影響是:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"連接超時帶來的錯誤。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"性能會有所降低。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"解決這個問題可以使用節點分片的策略。實際上我們在抖音春晚紅包的場景中也是非常廣泛地使用了這個策略。這個策略對節點數較多的服務進行節點分片,然後建立起一種映射關係,使得如下圖中所示的 A 服務的分片 0 發出的流量一定能到達 service B 的分片 0。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/77\/7792539cf75f38d53fb522e883a0f077.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣就可以大大提升長連接的複用率。對於原先 10000"},{"type":"text","marks":[{"type":"italic"}],"text":"10000 的對應關係,現在就變成了一個常態的關係,比如 100"},{"type":"text","text":"100。我們通過節點分片的策略大大提升了長連接的複用率,降低了連接超時帶來的錯誤,並且提升了微服務的性能。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"效率策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"前面提到的限流、熔斷、降級、動態過載保護、節點分片都是提升微服務穩定性相關的策略,還會有一些與效率相關的策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們先介紹一下泳道和染色分流的概念。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b9\/b9619cfb93b777d0b2320a4078d5150a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖中所示的某個功能可能涉及到 a、b、c、d、e、f 六個微服務。泳道可以對這些流量進行隔離,每一個泳道內完整地擁有這六個微服務,它們可以完整的完成一個功能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"染色分流是指根據某些規則使得流量打到不同的泳道,然後藉此來完成一些功能,這些功能主要包括:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Feature 調試"},{"type":"text","text":":在線上的開發測試過程中,可以把個人發出的一些請求打到自己設置的泳道並進行 Feature 調試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"故障演練"},{"type":"text","text":":在抖音春晚活動的一些服務開發完成之後,需要進行演練以對應對不同的故障。這時我們就可以把壓測流量通過一些規則引流到故障演練的泳道上。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"流量錄製回放"},{"type":"text","text":":把某種規則下的流量錄製下來,然後進行相關回放,主要用於 bug 調試或在某些黑產場景下發現問題。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"安全策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"安全策略也是流量治理的重要環節。我們主要提供三種安全策略:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"授權"},{"type":"text","text":":授權是指限定某一個服務能夠被哪些服務調用。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"鑑權"},{"type":"text","text":":當一個服務接收到流量時,需要鑑定流量來源的真實性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"雙向加密(mTLS)"},{"type":"text","text":" :爲了防止流量內容被窺探、篡改或被攻擊,需要使用雙向加密。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過以上的這些策略,我們提供了可靠的身份認證,安全地傳輸加密,還可以防止傳輸的流量內容被篡改或攻擊。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"春晚紅包場景落地"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過前面提到的各種策略,我們可以大大提升微服務的穩定性以及業務研發的效率。但是當我們落地這一套架構的時候也會遇到一些挑戰,最主要的挑戰是性能問題。我們知道,通過增加一箇中間層,雖然提升了擴展性和靈活性,但同時也必然有一些額外的開銷,這個開銷就是性能。在沒有 Service Mesh 時,微服務框架的主要開銷來自於序列化與反序列化、網絡通訊、服務發現以及流量治理策略。使用了 Service Mesh 之後,會多出兩種開銷:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"協議解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於數據面進程代理的流量,需要對流量的協議進行一定的解析才能知道它從哪來到哪去。但是協議解析本身的開銷非常高,所以我們通過增加一個 header (key 和 value 的集合) 可以把流量的來源等服務元信息放到這個 header 裏,這樣只需要解析一兩百字節的內容就可以完成相關的路由。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"進程間通訊"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據面進程會代理業務進程的流量,通常是通過 iptables 的方式進行。這種方案的 overhead 非常高,所以我們採用了進程間通訊的方式,通過和微服務框架約定一個 unix domain socket 地址或者一個本地的端口,然後進行相關的流量劫持。雖然這種方式相對於 iptables 會有一些性能提升,它本身也存在的額外的一些開銷。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/66\/660faae50a91c10d6d0bb14f7160da7a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們是如何降低進程間通訊開銷的呢?在傳統的進程間通訊裏,比如像 unix domain socket 或者本地的端口,會涉及到傳輸的內容在用戶態到內核態的拷貝。比如請求轉發給數據面進程會涉及到請求在用戶態和內核態之間拷貝,數據面進程讀出來的時候又會涉及內核態到用戶態的拷貝,那麼一來一回就會涉及到多達 4 次的內存拷貝。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的解決方案是通過"},{"type":"text","marks":[{"type":"strong"}],"text":"共享內存"},{"type":"text","text":"來完成的。共享內存是 Linux 下最高性能的一種進程間通訊方式,但是它沒有相關的通知機制。當我們把請求放到共享內存之後,另外一個進程並不知道有請求放了進來。所以我們需要引入一些事件通知的機制,讓數據面進程知道。我們通過 unix domain socket 完成了這樣一個過程,它的效果是可以減少內存的拷貝開銷。同時我們在共享內存中引用了一個隊列,這個隊列可以批量收割 IO,從而減少了系統的調用。它起到的效果也是非常明顯的,在抖音春晚活動的一些風控場景下,性能可以提高 24%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"完成這些優化之後,要去落地的阻力就沒那麼大了。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"總結"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次分享主要爲大家介紹了 Service Mesh 技術能夠提供哪些流量治理能力來保證微服務的穩定和安全。主要包括三個核心點:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"穩定"},{"type":"text","text":":面對瞬時億級 QPS 的流量洪峯, 通過 Service Mesh 提供的流量治理技術,保證微服務的穩定性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"安全"},{"type":"text","text":":通過 Service Mesh 提供的安全策略,保證服務之間的流量是安全可信的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"高效"},{"type":"text","text":":春晚活動涉及衆多不同編程語言編寫的微服務,Service Mesh 天然爲這些微服務提供了統一的流量治理能力,提升了開發人員的研發效率。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Q&A"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q"},{"type":"text","text":":共享內存中的 IPC 通信爲什麼能夠減少系統調用?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A"},{"type":"text","text":":當客戶端進程把一個請求放到共享內存中之後,我們需要通知 Server 進程進行處理,會有一個喚醒的操作,每次喚醒意味着一個系統調用。當 Server 還沒有被喚醒的時候,或者它正在處理請求時,下一個請求到來了,就不需要再執行相同的喚醒操作,這樣就使得在請求密集型的場景下我們不需要去頻繁的喚醒,從而起到降低系統調用的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Q"},{"type":"text","text":":自研 Service Mesh 實現是純自研還是基於 Istio 等社區產品?如果是自研使用的是 Go 還是 Java 語言?數據面用的是 Envoy 麼?流量劫持用的 iptables 麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"A"},{"type":"text","text":":"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"數據面是基於 Envoy 進行二次開發的,語言使用 C++。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"流量劫持用與微服務框架約定好的的 uds 或者本地端口,不用 iptables。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"Ingess Proxy 和業務進程部署在同樣的運行環境裏,發佈升級不需要重啓容器。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:字節跳動技術團隊(ID:toutiaotechblog)"}]},{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/dA9bTQm0llLyy-AbE3gO-w","title":"xxx","type":null},"content":[{"type":"text","text":"抖音春晚活動背後的 Service Mesh 流量治理技術"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章