如何基於MindSpore實現萬億級參數模型算法?

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​​摘要:近來,增大模型規模成爲了提升模型性能的主要手段。特別是NLP領域的自監督預訓練語言模型,規模越來越大,從GPT3的1750億參數,到Switch Transformer的16000億參數,又是一個數量級的增加。","attrs":{}}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分享自華爲雲社區","attrs":{}},{"type":"link","attrs":{"href":"https://bbs.huaweicloud.com/blogs/280735?utm_source=infoq&utm_medium=bbs-ex&utm_campaign=ei&utm_content=content","title":"","type":null},"content":[{"type":"text","text":"《一文帶你瞭解MindSpore支持的萬億級參數超大模型關鍵技術!》","attrs":{}}]},{"type":"text","text":",原文作者:HWCloudAI  。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"前言","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"近來,增大模型規模成爲了提升模型性能的主要手段。特別是NLP領域的自監督預訓練語言模型,規模越來越大,從GPT3的1750億參數,到Switch Transformer的16000億參數,又是一個數量級的增加。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型規模的數量級的增大,雖然取得了一定程度的性能提升,甚至產生了某些意想不到的“神奇”效果(如GPT3),但其背後的計算開銷成了最大的問題,比如GPT3訓練使用了萬級的GPU和數週的訓練時間。如何既能利用超大規模的參數來提升模型表達和性能,又能控制計算量的較小的增加,成爲了最主要的挑戰之一。以MoE爲代表的動態神經網絡技術被重點引入。大腦是典型的低能耗高效率的計算模式,稀疏激活是最重要的特性。除了巨型模型在訓練推理特別是訓練時的計算效能挑戰外,當前巨型模型的訓練優化算法另一個更大的挑戰是(不在此處討論),BP算法是當前最爲可用的深度網絡優化,但更理想的優化算法需要高並行、優化過程非對稱、並能夠在時空維度通過局部持續優化完成整體優化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 傳統的神經網絡模型,前饋的時候,輸入的batch中,每一個樣本的處理,都將激活網絡中的每一個參數參與計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 條件計算最寬鬆的定義,指僅激活網絡中某些部分的一類算法。Conditional Computation refers to aclass of algorithms that activate only some of the different parts in anetwork. 在具體某類條件計算實現中,條件選擇模式,可能按照輸入的batch中每sample獨立激活網絡不同部分,可能按照輸入數據空間上不同的部分(比如image不同區域或者channel),可能按照輸入數據時間上不同的部分(比如time series的不同slide window或者video的不同的frame。),可能按照目標任務的不同每task獨立的,可能按照非可學習的固定的隨機分配不同的子網獨立計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 對不同的輸入(原始或者前層),按照一定條件,選擇性的執行後續部分網絡的計算,這個技術下,有一些近似或相關的技術,如:dynamic neuralnetwork(s), conditional computing, conditional activation, sparse activating,selective execution, mixture of experts (MoE), dynamic routing, …;強相關的一些模型比如 Switch Transformer等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算的分類(廣義)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 按照routing是否可學習可以分爲:learnable routingconditional computation和 unlearnable routing conditionalcomputation.","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 按照activation是否不執行non-activation計算,可以分爲:hard conditionalcomputation和soft conditional computation。對於hard-mode的條件計算,通過tensor挑選切分等操作,無論何種條件選擇模式,不需要激活的數據將完全不參與不激活的網絡部分的計算;soft-mode的條件計算,可能僅採取將相關數據置零等方式來避免產生計算效果,但還是和不需要激活網路部分實際執行計算過程。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算的主要優勢","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 計算有效,降低能耗:通過部分激活部分計算,以每樣本條件激活的條件計算爲例,單個樣本只需要經過整個SuperNet的一部分參與計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 更大網絡,表達更強:由於一處到多處的Route,各處(層)的Input被路由到不同的子網獨立計算,不同的輸入的相互在各層的表達相對獨立沒有影響,表達能力更強,網絡可以更大,但表達效率降低了。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算的網絡和計算形式","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"條件計算的網絡和計算形式比較靈活,部分構建形式如:(此處省略具體模型和論文引用,參見: intellabs.github.io/dis)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 按照CV等task的特點,用多個獨立的CNN作爲expert網絡,按照task來獨立路由,尾部組合後給一個大網絡。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 使用更復雜的cascading等形式組合不同層級的不同的expert網絡。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 通過決策樹等方法做數據變換實現路由。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 通過可學習的網絡來選擇路由。其中策略學習的損失有多種構建形式:直接使用分類等任務的主損失,對不同專家的重要性和負載構建損失作爲輔助損失等等。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算的路由策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.non-learnable/hard-mode,通過某種確定性策略,如LSH等方式計算路由。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. learnable-mode,通過可學習網絡計算路由。網絡規模可大可小,簡單的可學習路由爲單層權重:G(x) = P(X*W),G(x)爲路由Gate函數,X爲輸入, W爲通損失函數來度量的可學習路由權重,P爲某種挑選函數(如topk, sort等),在實際實現中,X*W的輸入與權重計算結果可能作爲後續網絡的輸入信息的一部分,不僅僅利用G(x)來選擇路由,則需要對X*W的結果做歸一化,更典型的形式則爲:G(x)=P(N(X*W)),其中N爲表達Normalization函數,如Softmax。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算的冗餘策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"條件計算的冗餘策略,可分爲無冗餘條件計算和冗餘條件計算:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 無冗餘條件計算可通過P(.)函數的實現如topk(k=1,…)來實現;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 冗餘條件計算,可以多種實現形式,可以通過P(.)函數的實現如topk(k=n,…),n>=2來實現,也可以通過硬冗餘模式,整個網絡中支持輸入的複製和多路計算實現。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算的挑戰","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 路由算法對模型質量的影響","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論輸入和路由權重作用的信息(X*W),是僅作爲路由選擇並作爲後續網絡單元的輸入,還是直接作爲後續網絡單元的輸入的一部分,路由算法決定了輸入信息的處理流向,對模型的整體質量都有很大影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 路由(routing)/門(gate)的穩定性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨機初始化的路由/門的權重,權重自身在不斷被訓練調整;在前後層的網絡持續訓練變化,同一樣本在訓練的不同階段會被分派到不同的後續網絡單元中,這種動態變化過於劇烈,將嚴重影響整個網絡訓練過程的穩定性和收斂速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、路由的專家樣本重要性和負載的平衡性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練階段,每專家和樣本批次中樣本的關聯度重要性,和每批次中樣本被均衡分派到不同專家的負載平衡性,這兩個指標既相關又衝突。需要分別構建損失函數作爲輔助損失,來優化這兩個指標。在arxiv:1701.06538《Outrageously Large Neural Networks: The Sparsely-GatedMixture-of-Experts Layer》做了相關討論。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"關於條件計算/動態神經網絡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於條件計算/動態神經網絡,更多的信息在《DynamicNeural Networks: A Survey》arxiv:2102.04906(arxiv.org/abs/2102.0490)一文中,作者對廣義的動態神經網絡,將各種動態網絡相關的技術按照實例級、時間級、空間級做了分類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. Instance-wise Dynamic NN:逐實例動態,每樣本獨立激活不同的網絡和參數(MoE爲這個方向)。Dynamic Architecture:Dynamic Depth、Dynamic Width、Dynamic Routing/MoE;Dynamic Parameter:Parameter Adjustment、Parameter Prediction、Dynamic Feature(s)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. Spatial-wise Dynamic NN:空間級動態:圖像等不同空間位置激活後續不同網絡和參數。(CNN等):Pixel Level、Region Level、Resolution Level","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.Temporal-wise Dynamic NN:時間級動態:時序數據按時序維切分激活後續不同網絡和參數。(video-frames,text-sequence, time-series, stream, ...)Text-SequenceVideo-Frames","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述爲該綜述論文對Dynamic NN的總體分類。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從超大規模網絡動態網絡技術支撐角度,高表達能力,低計算代價爲主的來考慮分類,從兩個維度對動態網絡技術分類:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 按照在前饋計算時是否部分激活:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hard-Dynamic:在前饋的時候,部分網絡絕對不激活參與計算","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Soft-Dynamic:在前饋的時候,部分網絡經過softmax等gate/route後,通過張量元素置零等方式,失去表達能力,但會參與計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 按照動態激活判定算法的輸入:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"逐樣本級:(在輸入層)按照每樣本的實例來決定動態網絡的後續激活。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"亞樣本級:(在輸入層)樣本內時間/空間級激活不同的後續網絡單元。一般深度網絡,不僅在輸入層會被選擇性激活執行,在中間層也類似。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,智能平臺支持Hard-Dynamic逐樣本級的動態神經網絡,能比較自然的獲得網絡結構大顆粒的稀疏激活,在超大模型中能實現訓練和推理的高能效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"動態神經網絡相比與靜態結構的神經網絡,在相關研究中,從效能,表達,泛化、魯棒,可解釋等方面做了大量對比研究。從智能平臺通過計算成本儘量低的支持超大規模網絡來提升模型性能的角度看,Efficiency和Representation最爲重要:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、Efficiency:靜態網絡“牽一髮而動全身”,每一個樣本輸入整個網絡/所有參數都要響應,這對超大網絡來取得領先效果的模型能耗挑戰太大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、Representation: 參數量更大,表達容量更大;但MoE等結構在深度網絡的各層特徵的表達上,複用降低,每參數的表達效率更低。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"實現策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現各種模型的帶有動態路由稀疏激活的超大規模參數版本,需要分模型研究和實現。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以Switch Transformer爲例,其參數擴展到部分在Transformer的FFN部分。其MoE化擴展,如下圖:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/34/34b2a6712a21f1263b95b376e2ea7dbb.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​(圖片來源:Switch Transformer論文)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可見,MoE化主要變化在需要Expert子網絡前後增加MoE相關的邏輯。本文主要介紹平臺上的實現。動態路由條件計算,主要包括四個步驟:路由計算、數據分派、獨立計算,結果合併。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/48/48b63cf984394ee9a9850d4b2ebe13be.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​1. 路由計算-Gate:根據輸入(可以爲整個網絡的輸入,或者前面網絡單元/層的輸出),在路由單元完成計算,在以batch內sample-wise的路由中,計算出每個樣本要分派的後續網絡路由(Mixture-of-Experts/MoE中的專家)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. 數據分派-Dispatch:從輸入的整體的Tensor中,按照路由計算的樣本-專家關係,收集合並出每個專家需要處理的Tensor。如果在固定expert-batch的設計中,要平衡每批訓練中,分派到每個專家的樣本數和專家每輪訓練最大容量,由於樣本輸入的隨機性,很難保證較爲均勻的分派,對於低於最大容量的批次,對固定batch-size的要做pad,對於高於最大容量的樣本,可以採用延後重採樣等方式。爲了維護正確的輸入輸出關係(Input/X – Label/Y)和訓練是反向傳播的求導關係,實現中需要維護原始batch到每專家的sub-batch的index關係,在後來求導和結合合併時使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. 獨立計算-Expert:併發(邏輯上可以先後)調用各個專家處理對應的sub-batch。這也是智能平臺要支持的併發API之一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. 結果合併-Combine:合併每專家的結果tensor到整個batch的tensor,並按照數據分派索引,交換到原始輸入的順序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在主流的深度學習智能平臺中,可以採用兩類主要的實現策略:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"張量置零","attrs":{}},{"type":"text","text":":對需要分派到不同的後續網絡單元(專家網絡子網等),對需要分派的專家拷貝若干份tensor,對於不應輸入當前專家處理的數據維度置零。該方式在保證置零計算邏輯正確的情況下,實現簡單,全張量操作,對平臺無特殊要求,適用於算法研究,僅體現條件計算前序數據被動態路由到不同的後續網絡單元,分析算法的效果。如果通過置零方式,該方法每個專家處理的tensor在batch維度大小是全batch,不能節省計算量和內存使用量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"張量整理","attrs":{}},{"type":"text","text":":對需要分派到不同的後續網絡單元(專家網絡子網等),對需要分派的專家拷貝若干份tensor,對於不應輸入當前專家處理的數據維度不保留。並維護好sample級的index在變換前後的對應關係。在分佈式友好的實現中,如果專家子網爲單位被劃分到不同的計算節點,那麼專家網絡的實現最好從子網級的平臺對象繼承後實現,比如:MindSpore中的mindspore.nn.Cell。詳細實現細節參見後續技術實現章節。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"核心代碼","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"核心代碼:路由計算、數據分派、獨立計算,結果合併","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考代碼採用MindSpore示意實現。(注:importmindspore as ms)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Mixture of Experts的核心邏輯,對輸入I,經過routing_network(最簡單*W即可),然後topk(若變種算法需要gate權重則需要softmax,否則可不),然後用tensor的操作(可按照batch)選擇出每個subnetwork/expert的張量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲方便調試,採用了規模極小的非隨機的確定數值構造輸入和路由權重,路由網絡採用簡單的X*W。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、路由","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2c/2cbb104e6fd38f665e969a27f1668295.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​當上述輸入5行(僅3類,希望分派給3個專家)樣本,和Gate權重做矩陣乘後,可以明確算出每個樣本要分派的專家。可以用matmul,也可以類似gates_weighted = einsum('bd,de->be', [data_inputs, gate_weights])第一輪矩陣乘的結果爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/be3454230851d48c1bd125f1ed2e5884.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"輸入和權重乘法,在python中可以採用@,也可以採用matmul,也可以採用愛因斯坦求和簡記憶法函數einsum。當是簡單的矩陣乘的時候,採用einsum在計算圖編譯的時候實際會拆分成多個算法,性能並不好;但當輸入和權重超過2維,需要以batch維固定做路由計算的時候,使用einsum可以編程實現很簡單。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、分派","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"條件計算的分派,主要邏輯是根據路由網絡的輸出,爲每個樣本計算出top-k的專家。其實現可以通過topk函數實現。由於top選擇score可作爲後續網絡單元的輸入信息(含路由的信息),所以一般要對路由輸出做softmax做歸一化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c2/c2854e90ebacc19cf2327bddd0a76856.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按需計算1:all-N專家之間的歸一化權重(please refer to #2) ,gates_weighted一樣,按照dim=-1做了歸一化而已其輸出爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6a/6a0e7130fbf990a002cda91a2472c8b4.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​爲batch中每個sample選擇Top-K個專家 這裏爲batch中每個的專家權重,可以從softmax-ed來top-k,也可以直接從gates_weighted來top-k;由於這裏可能不做softmax或者延後,所以可gates_weighted,這裏爲batch中每個的專家序號","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/64/64e647bb08cdef3ffb0f7735a6b35bd0.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​其輸出爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3b7a1e17f50be904084645e2cc1d63a2.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接着:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ae/ae4906f6e00313a36f9c3251d367b184.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按需計算2: top-n專家之間的歸一化權重","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何根據分派索引,從原始的輸入中,爲每個專家提取出屬於該專家處理的tensor,在當前的主流智能平臺,都沒有專門的算子,可以通過其他算子的組合來實現類似的效果。在MindSpore中,可以通過底層的C++實現算子,也可以通過Python中繼承Cell並實現bprob,然後將原始 gate tensor中按照index組織到目標輸出中。這裏我們實現一個Dispatch類","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/53/534d217fdcfcef9ad9a0c1f6447bc941.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3f/3ff81c92793c42099e9f42d3743bb56a.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​3、獨立計算","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"直接並行調用後續的專家網絡。並行部分可以通過平臺來支持。可以通過特殊的函數或者annotation等標識,也可以由平臺編譯時優化爲並行執行。(在非動態路由條件計算的網絡模型中,一般不存在類似的優化。)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4、合併","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合併的邏輯相對簡單,先通過cat按照batch維度做拼接,然後構造正確的zeros tensor用index_add按照索引將各個專家網絡的結果在保持input序合併到一起,做爲該MoE模塊的輸出。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b2f0018dd9146326df29c9a707196b1e.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"​上述完成了整個MoE的完整計算過程。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"代碼框架","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們按照上述基本動態路由條件計算的張量操作爲主的邏輯,擴展到一個完整的訓練代碼框架中:class Dispatch(ms.nn.Cell): 實現路由中的分派邏輯classCombine(ms.nn.Cell): 實現路由中的組裝邏輯class Route(ms.nn.Cell): 完成整個動態路由邏輯,可以實現爲相對通用的類class Expert(ms.nn.Cell): 平臺用戶自定義的專家網絡classNetwork(ms.nn.Cell): 平臺用戶自定義的大網絡class MSELoss(ms.nn.Cell):實現MSE損失,實現輔助損失的邏輯classOutputLossGraph(ms.nn.Cell):輸出infer和loss,PyNative模式單步class Dataset: 數據集類,僅滿足輸入shape和X-Y合理對應關係,僅僅示例def train( …): 訓練入口完整的代碼在mindspore官網:","attrs":{}},{"type":"link","attrs":{"href":"https://gitee.com/mindspore_ci/mindspore","title":"","type":null},"content":[{"type":"text","text":"https://gitee.com/mindspore_ci/mindspore","attrs":{}}],"marks":[{"type":"underline"},{"type":"strong"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"條件計算實現技術點","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1、動態路由","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不可學習路由","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如使用LSH (locality sensitive hashing)做路由:在整個可學習網絡的前端,使用LSH來分派樣本,這樣可以避免LSH部分求導問題;如果在網絡中間增加LSH模塊,需要通過梯度估計完成確定性算法部分梯度傳遞。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可學習路由","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單的做法,定義gate_weights爲可學習Parameter,對於二維的張量,通過python@或者matmul等完成權重路由計算;如果是更高維度的張量,且需固定batch維,einsum('bd*,*de->b*e')的形式完成計算。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2、topk和softmax的前後關係","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在G_1(x)=softmax(topk(X*W)))和G_2(x)=topk(softmax(X*W)))兩類Gate實現中,將softmax置於Topk前後,對top-k的選擇不變;當需要將G_*作爲後序網絡輸入的一部分,即將路由權重信息作爲後續網絡輸入信息,則需要考慮:需要all-N專家之間的歸一化權重,則softmax置於top-k之前;否則softmax置於top-k之後,來計算top-N專家之間的歸一化權重。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3、如何每專家在批次處理中平衡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照每樣本的路由權重求和,即對batch單個樣本被分配的1+個export的重要性和權重求和,計算出importance;按照每樣本的路由權重中非0的求和,計算出有負載的專家來求得load。將coefficient_of_variation(importance) +coefficient_of_variation(load)作爲auxiliary_loss參與優化,來平衡importance和load。變異係數(Coefficientof Variation)是用於無量綱度量數據的離散程度,越離散在此處表示均衡性越差,需要向更小優化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Transformer等多層(多處)MoE的模型中,將多組auxiliary_loss聯合作爲auxiliary_loss,在加dominated_loss之後即可。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://bbs.huaweicloud.com/blogs?utm_source=infoq&utm_medium=bbs-ex&utm_campaign=ei&utm_content=content","title":"","type":null},"content":[{"type":"text","text":"點擊關注,第一時間瞭解華爲雲新鮮技術~","attrs":{}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章