百度C++工程師的那些極限優化(內存篇)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"導讀:","attrs":{}},{"type":"text","text":"在百度看似簡簡單單的界面後面,是遍佈全國的各個數據中心裏,運轉着的海量C++服務。如何提升性能,降低延時和成本就成了百度C++工程師的必修功課。伴隨着優化的深入攻堅,誕生並積累下來一系列的性能優化理論和方案,其中不乏一些冷門但精巧實用的經驗和技巧。本文從內存訪問角度,收集總結了一些具有通用意義的典型案例,分享出來和大家學習交流。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"1 背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在百度看似簡簡單單的界面後面,是遍佈全國的各個數據中心裏,運轉着的海量C++服務。對C++的重度應用是百度的一把雙刃劍,學習成本陡峭,指針類錯誤定位難、擴散性廣另很多開發者望而卻步。然而在另一方面,語言層引入的額外開銷低,對底層能力可操作性強,又能夠爲追求極致性能提供優異的實踐環境。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,對百度的C++工程師來說,掌握底層特性並加以利用來指導應用的性能優化,就成了一門必要而且必須的技能。久而久之,百度工程師就將這種追求極致的性能優化,逐漸沉澱成了習慣,甚至形成了對技術的信仰。下面我們就來盤點和分享一些,在性能優化的征途上,百度C++工程師積累下來的理論和實踐,以及那些爲了追求極致,所發掘的『奇技淫巧』。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"2 重新認識性能優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲程序員,大家或多或少都會和性能打交道,尤其是以C++爲主的後端服務工程師,但是每個工程師對性能優化概念的理解在細節上又是千差萬別的。下面先從幾個優化案例入手,建立一個性能優化相關的感性認識,之後再從原理角度,描述一下本文所講的性能優化的切入角度和方法依據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.1 從字符串處理開始","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.1 string as a buffer","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了調用底層接口和集成一些第三方庫能力,在調用界面層,會存在對C++字符串和C風格字符串的交互場景,典型是這樣的:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"size_t some_c_style_api(char* buffer, size_t size);\nvoid some_cxx_style_function(std::string& result) {\n // 首先擴展到充足大小\n result.resize(estimate_size);\n // 從c++17開始,string類型支持通過data取得非常量指針\n auto acture_size = some_c_style_api(result.data(), result.size());\n // 最終調整到實際大小\n result.resize(acture_size);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個方法存在一個問題,就是在首次resize時,string對estimate_size內的存儲區域全部進行了0初始化。但是這個場景中,實際的有效數據其實是在some_c_style_api內部被寫入的,所以resize時的初始化動作其實是冗餘的。在交互buffer的size較大的場景,例如典型的編碼轉換和壓縮等操作,這次冗餘的初始化引入的開銷還是相當可觀的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這個問題,大約從3年前開始,已經有人在持續嘗試推動標準改進。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r7.html","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注:在這個問題上使用clang + libc++的同學有福,較新版本的libc++中已經非標實現了resize_default_init功能,可以開始嚐鮮使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在標準落地前,爲了能夠在百度內部(目前廣泛使用gcc8和gcc10編譯器)提前使用起來,我們專門製作了適用於gcc的resize_uninitialized,類似於上面的功能,在百度,可以這樣編碼:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"size_t some_c_style_api(char* buffer, size_t size);\nvoid some_cxx_style_function(std::string& result) {\n auto* buffer = babylon::resize_uninitialized(result, estimate_size);\n auto acture_size = some_c_style_api(buffer, result.size());\n result.resize(acture_size);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.2 split string","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際業務中,有一個典型場景是一些輕schema數據的解析,比如一些標準分隔符,典型是'_'或者'\\t',簡單分割的分列數據(這在日誌等信息的粗加工處理中格外常見)。由於場景極其單純,可能的算法層面優化空間一般認爲較小,而實際實現中,這樣的代碼是廣爲存在的:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"std::vector<:string> tokens;\n// boost::split\nboost::split(token, str, [] (char c) {return c == '\\t';});\n// absl::StrSplit\nfor (std::string_view sv : absl::StrSplit(str, '\\t')) {\n tokens.emplace_back(sv);\n}\n// absl::StrSplit no copy\nfor (std::string_view sv : absl::StrSplit(str, '\\t')) {\n direct_work_on_segment(sv);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"boost版本廣泛出現在新工程師的代碼中,接口靈活,流傳度高,但是實際業務中效率其實並不優秀,例如和google優化過的absl相比,其實有倍數級的差距。尤其如果工程師沒有注意進行單字符優化的時候(直接使用了官方例子中的is_any_of),甚至達到了數量級的差距。進一步地,如果聯動思考業務形態,一般典型的分割後處理是可以做到零拷貝的,這也可以進一步降低冗餘拷貝和大量臨時對象的創建開銷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f9/f98de618fd5a414f1eff15a806b1f1d3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,再考慮到百度當前的內部硬件環境有多代不同型號的CPU,進一步改造spilt顯式使用SIMD優化,並自適應多代向量指令集,可以取得進一步的性能提升。尤其是bmi指令加速後,對於一個SIMD步長內的連續分隔符探測,比如密集短串場景,甚至可以取得數量級的性能提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最終在百度,我們可以這樣編碼實現:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"babylon::split([] (std::string_view sv) {\n direct_work_on_segment(sv);\n}, str, '\\t'};","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2.1.3 magic of protobuf","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隨着brpc在百度內部的廣泛應用,protobuf成爲了百度內部數據交換的主流方式,解析、改寫、組裝protobuf的代碼在每個服務中幾乎都會有一定的佔比。尤其是近幾年,進一步疊加了微服務化的發展趨勢之後,這層數據交換邊界就變得更加顯著起來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在有些場景下,例如傳遞並增加一個字段,或者從多個後端存儲獲取分列表達的數據合併後返回,利用標準的C++API進行反序列化、修改、再序列化的成本,相對於實際要執行的業務來說,額外帶來的性能開銷會顯著體現出來。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"舉例來說,比如我們定義了這樣的message:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"message Field {\n\n bytes column = 1;\n bytes value = 2;\n};\nmessage Record {\n bytes key = 1;\n repeated Field field = 2;\n};\nmessage Response {\n repeated Record record = 1;\n bytes error_message = 2;\n};","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們設想一個場景,一個邏輯的record分散於多個子系統,那麼我們需要引入一個proxy層,完成多個partial record的merge操作,常規意義上,這個merge動作一般是這樣的:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"one_sub_service.query(&one_controller, &request, &one_sub_response, nullptr);\nanother_sub_service.query(&another_controller, &request, &another_sub_response, nullptr);\n...\nfor (size_t i = 0; i < record_size; ++i) {\n final_response.mutable_record(i).MergeFrom(one_sub_response.record(i));\n final_response.mutable_record(i).MergeFrom(another_sub_response.record(i));\n ...\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於一個輕量級proxy來說,這一層反覆對後端的解析、合併、再序列化引入的成本,就會相對凸現出來了。進一步的消除,就要先從protobuf的本質入手。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c8c6c3eb735c17aceef0f65a53df9a2d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"protobuf的根基先是一套公開標準的wire format,其上纔是支持多語言構造和解析的SDK,因此嘗試降低對解析和合並等操作的進一步優化,繞過c++api,深入wire format層來嘗試是一種可行且有效的手段。那麼我們先來看一下一些wire format層的特性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6b/6bfe897f51f55e3d62a6273eb7d235c0.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即message的構成直接由內部包含的field的序列化結果堆砌而成,field之間不存在分割點,在整個message外部,也不存在框架信息。基於這個特性,一些合併和修改操作可以在序列化的bytes結果上被低成本且安全地操作。而另一方面,message field的格式和string又是完全一致的,也就是定義一個message field,或者定義一個string field而把對應message序列化後存入,結果是等價的(而且這兩個特性是被官方承諾的)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://developers.google.com/protocol-buffers/docs/encoding#optional","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結合這些特性,之前的合併操作在實現上我們改造爲:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"message ProxyResponse {\n // 修改proxy側的message定義,做到不深層解析\n repeated string record = 1;\n bytes error_message = 2;\n};\n\none_sub_service.query(&one_controller, &request, &one_sub_response, nullptr);\nanother_sub_service.query(&another_controller, &request, &another_sub_response, nullptr);\n...\nfor (size_t i = 0; i < record_size; ++i) {\n // 利用string追加替換message操作\n final_response.mutable_record(i).append(one_sub_response.record(i));\n final_response.mutable_record(i).append(another_sub_response.record(i));\n ...\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在微服務搭的環境下,類似的操作可以很好地控制額外成本的增加。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"2.2 回頭來再看看性能優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般來講,一個程序的性能構成要件大概有三個,即算法複雜度、IO開銷和併發能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/39/39daa7e83da1f12ca295cdcd34d61e2a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首要的影響因素是大家都熟悉的算法複雜度。一次核心算法優化和調整,可以對應用性能產生的影響甚至是代差級別的。例如LSM Tree對No-SQL吞吐的提升,又例如事件觸發對epoll大併發能力的提升。然而正因爲關注度高,在實際工程實現層面,無論是犯錯機率,還是留下的優化空間,反而會大爲下降。甚至極端情況下,可能作爲非科研主導的工程師,在進行性能優化時鮮少遇到改良算法的場景,分析問題選擇合適算法會有一定佔比,但是可能範圍也有限。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更多情況下需要工程師解決的性能問題,借用一句算法競賽用語,用『卡常數』來形容可能更貼切。也就是採用了正確的合適的算法,但是因爲沒有采用體系結構下更優的實現方案,導致在O(X)上附加了過大的常數項,進而造成的整體性能不足。雖然在算法競賽中,卡常數和常數優化是出題人和解題人都不願意大量出現的干擾項(因爲畢竟是以核心算法本身爲目標),但是轉換到實際項目背景下,常數優化卻往往是性能優化領域的重要手段。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在我們再來回顧一下上面引出問題的三個優化案例。可以看到,其中都不包含算法邏輯本身的改進,但是通過深入利用底層(比如依賴庫或指令集)特性,依然能夠取得倍數甚至數量級的優化效果。這和近些年體系結構變得越發複雜有很大關聯,而這些變化,典型的體現場景就是IO和併發。併發的優化,對於工程經驗比較豐富的同學應該已經不陌生了,但是關於IO,尤其是『內存IO』可能需要特別說明一下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與代碼中顯示寫出的read/write/socket等設備IO操作不同,存儲系統的IO很容易被忽略,因爲這些IO透明地發生在普通CPU指令的背後。先列舉2009年Jeff Dean的一個經典講座中的一頁數字。https://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/15/15c234841247d44835c856eefade975d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然已經是十多年前的數據,但是依然可以看出,最快速的L1 cache命中和Main memory訪問之間,已經拉開了多個數量級的差距。這些操作並不是在代碼中被顯式控制的,而是CPU幫助我們透明完成的,在簡化編程難度的同時,卻也引入了問題。也就是,如果不能良好地適應體系結構的特性,那麼看似同樣的算法,在常數項上就可能產生數量級的差異。而這種差異因爲隱蔽性,恰恰是最容易被新工程師所忽略的。下面,我們就圍繞內存訪問這個話題,來盤點一下百度C++工程師的一些『常數優化』。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"3 從內存分配開始","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要使用內存,首先就要進行內存分配。進入了c++時代後,隨着生命週期管理的便捷化,以及基於class封裝的各種便捷容器封裝的誕生,運行時的內存申請和釋放變得越來越頻繁。但是因爲地址空間是整個進程所共享的一種資源,在多核心繫統中就不得不考慮競爭問題。有相關經驗的工程師應該會很快聯想到兩個著名的內存分配器,tcmalloc和jemalloc,分別來自google和facebook。下面先來對比一下兩者的大致原理。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.1 先看看tcmalloc和jemalloc","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8a/8a71c495dbaadeda358e401344ca57da.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對多線程競爭的角度,tcmalloc和jemalloc共同的思路是引入了線程緩存機制。通過一次從後端獲取大塊內存,放入緩存供線程多次申請,降低對後端的實際競爭強度。而典型的不同點是,當線程緩存被擊穿後,tcmalloc採用了單一的page heap(簡化了中間的transfer cache和central cache,他們也是全局唯一的)來承載,而jemalloc採用了多個arena(默認甚至超過了服務器核心數)來承載。因此和網上流傳的主流評測推導原理一致,在線程數較少,或釋放強度較低的情況下,較爲簡潔的tcmalloc性能稍勝過jemalloc。而在覈心數較多、申請釋放強度較高的情況下,jemalloc因爲鎖競爭強度遠小於tcmalloc,會表現出較強的性能優勢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般的評測到這裏大致就結束了,不過我們可以再想一步,如果我們願意付出更多的內存到cache層,將後端競爭壓力降下來,那麼是否tcmalloc依然可以回到更優的狀態呢?如果從上面的分析看,應該是可以有這樣的推論的,而且近代服務端程序的瓶頸也往往並不在內存佔用上,似乎是一個可行的方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過實際調整過後,工程師就會發現,大多數情況下,可能並不能達到預期效果。甚至明明從perf分析表現上看已經觀測到競爭開銷和申請釋放動作佔比很小了,但是整個程序表現依然不盡如人意。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這實際上是內存分配連續性的對性能影響的體現,即線程緩存核心的加速點在於將申請批量化,而非單純的降低後端壓力。緩存過大後,就會導致持續反覆的申請和釋放都由緩存承擔,結果是緩存中存放的內存塊地址空間分佈越來越零散,呈現一種洗牌效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/83/83732fd8e4dc888497990dd3900b91c0.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"體系結構的緩存優化,一般都是以局部性爲標準,也就是說,程序近期訪問的內存附近,大概率是後續可能被訪問的熱點。因此,如果程序連續申請和訪問的內存呈跳躍變化,那麼底層就很難正確進行緩存優化。體現到程序性能上,就會發現,雖然分配和釋放動作都變得開銷很低了,但是程序整體性能卻並未優化(因爲真正運行的算法的訪存操作常數項增大)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.2 那麼理想的malloc模型是什麼?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過前面的分析,我們大概得到了兩條關於malloc的核心要素,也就是競爭性和連續性。那麼是否jemalloc是做到極致了呢?要回答這個問題,還是要先從實際的內存使用模型分析開始。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/3029fe9f5b6da7623c858e8dafa215bd.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個很典型的程序,核心是一組持續運行的線程池,當任務到來後,每個線程各司其職,完成一個個的任務。在malloc看來,就是多個長生命週期的線程,隨機的在各個時點發射malloc和free請求。如果只是基於這樣的視圖,其實malloc並沒有辦法做其他假定了,只能也按照基礎局部性原理,給一個線程臨近的多次malloc,儘量分配連續的地址空間出來。同時利用線程這一概念,將內存分區隔離,減少競爭。這也就是tcmalloc和jemalloc在做的事情了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是內存分配這件事和程序的邊界就只能在這裏了嗎?沒有更多的業務層輸入,可以讓malloc做的更好了嗎?那麼我們再從業務視角來看一下內存分配。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b0/b02948d2d62c52c78b12196f12cdbad2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微服務、流式計算、緩存,這幾種業務模型幾乎涵蓋了所有主流的後端服務場景。而這幾種業務對內存的應用有一個重要的特徵,就是擁有邊界明確的生命週期。回退到早期的程序設計年代,其實server設計中每個請求單獨一個啓動線程處理,處理完整體銷燬是一個典型的方案。即使是使用線程池,一個請求接受後從頭到尾一個線程跟進完成也是持續了相當長時間的成熟設計。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而針對這種早期的業務模型,其實malloc是可以利用到更多業務信息的,例如線程動態申請的內存,大概率後續某個時點會全部歸還,從tcmalloc的線程緩存調整算法中也可以看出對這樣那個的額外信息其實是專門優化過的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是隨着新型的子任務級線程池併發技術的廣泛應用,即請求細分爲多個子任務充分利用多核併發來提升計算性能,到malloc可見界面,業務特性幾乎已經不復存在。只能看到持續運行的線程在隨機malloc和free,以及大量內存的malloc和free漂移在多個線程之間。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼在這樣job化的背景下,怎樣的內存分配和釋放策略能夠在競爭性和局部性角度工作的更好呢?下面我們列舉兩個方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2.1 job arena","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6e/6e4fe7e28fbbcecdcf603994059a9a63.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一種是基礎的job arena方案,也就是每個job有一個獨立的內存分配器,job中使用的動態內存註冊到job的arena中。因爲job生命週期明確,中途釋放的動態內存被認爲無需立即回收,也不會顯著增大內存佔用。在無需考慮回收的情況下,內存分配不用再考慮分塊對齊,每個線程內可以完全連續。最終job結束後,整塊內存直接全部釋放掉,大幅減少實際的競爭發生。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顯而易見,因爲需要感知業務生命週期,malloc接口是無法獲得這些信息並進行支持的,因此實際會依賴運行時使用的容器能夠單獨暴露內存分配接口出來。幸運的是,在STL的帶動下,現實的主流容器庫一般都實現了allocator的概念,儘管細節並不統一。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如重度使用的容器之一protobuf,從protobuf 3.x開始引入了Arena的概念,可以允許Message將內存結構分配通過Arena完成。可惜直到最新的3.15版本,string field的arena分配依然沒有被官方支持。https://github.com/protocolbuffers/protobuf/issues/4327","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,因爲string/bytes是業務廣爲使用的類型,如果脫離Arena的話,實際對連續性的提升就會大打折扣。因此在百度,我們內部維護了一個ArenaString的patch,重現了issue和註釋中的表達,也就是在Arena上分配一個『看起來像』string的結構。對於讀接口,因爲和string的內存表達一致,可以直接通過const string&呈現。對於mutable接口,會返回一個替代的ArenaString包裝類型,在使用了auto技術的情況下,幾乎可以保持無縫切換。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外一個重度使用的容器就是STL系列了,包括STL自身實現的容器,以及boost/tbb/absl中按照同類接口實現的高級容器。從C++17開始,STL嘗試將之前混合在allocator中的內存分配和實例構造兩大功能進行拆分,結果就是產生了PMR(Polymorphic Memory Resouce)的概念。在解耦了構造器和分配器之後,程序就不再需要通過修改模板參數中的類型,來適應自己的內存分配方法了。其實PMR自身也給出了一種連續申請,整體釋放的分配器實現,即monotonic_buffer_resource,但是這個實現是非線程安全的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結合上面兩個內存分配器的概念,在實際應用中,我們利用線程緩存和無鎖循環隊列(降低競爭),整頁獲取零散供給(提升連續)實現了一個SwissMemoryResource,通過接口適配統一支持STL和protobuf的分配器接口。最終通過protocol插件集成到brpc中,在百度,我們可以如下使用:","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\nbabylon::ReusableRPCProtocol::register_protocol();\n::baidu::rpc::ServerOptions options;\noptions.enabled_protocols = \"baidu_std_reuse\";\n\nclass SomeServiceImpl : public SomeService {\npublic:\n // request和response本身採用了請求級別的memory resource來分配\n virtual void some_method(google::protobuf::RpcController* controller,\n const SomeRequest* request,\n SomeResponse* response,\n google::protobuf::Closure* done) {\n baidu::rpc::ClosureGuard guard(done);\n // 通過轉換到具體實現來取得MemoryResource\n auto* closure = static_cast<:reusablerpcprotocol::closure>(done);\n auto& resource = closure->memory_resource();\n // 做一些請求級別的動態容器\n std::pmr::vector<:pmr::string> tmp_vector(&resource);\n google::protobuf::Arena::CreateMessage(&(Arena&)resource);\n ...\n // done->Run時請求級內存整體釋放\n }\n};","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3.2.2 job reserve","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/d3/d332da5800828f70c5c3c4c2c1a7f31a.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更復雜一些的是job reserve方案,在job arena的基礎上,結合了job結束後不析構中間結構,也不釋放內存,轉而定期進行緊湊重整。這就進一步要求了中間結構是可以在保留內存的情況下完成重置動作的,並且能夠進行容量提取,以及帶容量重新構建的功能。這裏用vector爲例解釋一下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a3/a33edbf8d9fbc8645a2a8d23d5fc34f1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"和典型的vector處理主要不同點是,在clear或者pop_back等操作縮減大小之後,內容對象並沒有實際析構,只是清空重置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,再一次用到這個槽位的時候,可以直接拿到已經構造好的元素,而且其capacity之內的內存也依然持有。可以看到反覆使用同一個實例,容器內存和每個元素自身的capacity都會逐漸趨向於飽和值,反覆的分配和構造需求都被減少了。瞭解過protobuf實現原理的工程師可以對照參考,這種保留實例的clear動作,也是protobuf的message鎖採用的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不過關注到之前提過局部性的工程師可能會發現,儘管分配需求降低了,但是最終飽和態的內存分佈在連續性上仍不理想,畢竟中途的動態分配是按需進行,而未能參考局部性了。因此容器還需要支持一個動作,也就是重建。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/67/670b93c4de53be6d2506bab247c44861.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也就是,當重複利用多次,發生了較多臨時申請之後,需要能夠提取出當前的容量schema,在新的連續空間中做一次原樣重建,讓內存塊重新迴歸連續。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"3.3 總結一下內存分配","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過分析malloc的性能原理,引入這兩種細粒度的內存分配和管理方案,可以在更小的競爭下,得到更好的內存連續性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在實測中,簡單應用做到job arena一般就可以取得大部分性能收益,一般能夠達到倍數級提升,在整體服務角度也能夠產生可觀測的性能節省。而job reserve,雖然可以獲得進一步地性能提升,但一方面是因爲如果涉及非protobuf容器,需要實現自定義的schema提取和重建接口,另一方面趨於飽和的capacity也會讓內存使用增大一些。引入成本會提高不少,因此實際只會在性能更爲緊要的環節進行使用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"4 再來看看內存訪問","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內存分配完成後,就到了實際進行內存訪問的階段了。一般我們可以將訪存需求拆解到兩個維度,一個是單線程的連續訪問,另一個是多個線程的共享訪問。下面就分拆到兩個部分來分別談談各個維度的性能優化方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.1 順序訪存優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一般來說,當我們要執行大段訪存操作時,如果訪問地址連續,那麼實際效率可以獲得提升。典型例如對於容器遍歷訪問操作,數組組織的數據,相比於比鏈表組織的數據,一般會有顯著的性能優勢。其實在內存分配的環節,我們引入的讓連續分配(基本也會是連續訪問)的空間地址連續性更強,也是出於這一目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼下面我們先來看一看,連續性的訪問產生性能差異的原理是什麼。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/87/87db9eed783c4f857e4b4b477775f517.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏以Intel CPU爲例來簡要描述一下預取過程。詳見:https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf當硬件監測到連續地址訪問模式出現時,會激活多層預取器開始執行,參考當前負載等因素,將預測將要訪問的數據加載到合適的緩存層級當中。這樣,當後續訪問真實到來的時候,能夠從更近的緩存層級中獲取到數據,從而加速訪問速度。因爲L1容量有限,L1的硬件預取步長較短,加速目標主要爲了提升L2到L1,而L2和LLC的預取步長相對較長,用於將主存提升到cache。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏局部性概念其實充當了軟硬件交互的某種約定,因爲程序天然的訪問模式總有一些局部性,硬件廠商就通過預測程序設計的局部性,來盡力加速這種模式的訪問請求,力求做到通用提升性能。而軟件設計師,則通過盡力讓設計呈現更多的局部性,來激活硬件廠商設計的優化路徑,使具體程序性能得到進一步優化。某種意義上講,z不失爲一個相生相伴的循環促進。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏通過一個樣例來驗證體現一下如何尊重局部性,以及局部性對程序的影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"// 設計一塊很大的存儲區域用於存儲固定維度的浮點向量\n// vecs中存儲每個浮點向量的起始地址\nstd::vector large_memory_buffer;\nstd::vector vecs;\n\nstd::shuffle(vecs.begin(), vecs.end(), random_engine);\n\nfor (size_t i = 0; i < vecs.size(); ++i) {\n __builtin_prefetch(vecs[i + step]);\n dot_product(vecs[i], input);\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個推薦/搜索系統中常見的內積打分場景,即通過向量計算來進行大規模打分。同樣的代碼,依據shuffle和prefetch存在與否,產生類似如下的表現:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"shuffle & no prefetch:44ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"shuffle & prefetch:36ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"shuffle & no prefetch:13ms","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"shuffle & prefetch:12ms","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從1和3的區別可以看出,完全相同的指令,在不同的訪存順序下存在的性能差距可以達到倍數級。而從1和2的區別可以看出,手動添加預取操作後,對性能有一定改善,預期更精細地指導預取步長和以及L1和L2的分佈還有改善空間。不過指令執行週期和硬件效率很難完備匹配,手動預取一般用在無法改造成物理連續的場景,但調參往往是一門玄學。最後3和4可以看出,即使連續訪存下,預取依然有一些有限的收益,推測和硬件預取無法跨越頁邊界造成的多次預測冷啓動有關,但是影響已經比較微弱了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最具備指導意義的可能就是類似這個內積打分的場景,有時爲了節省空間,工程師會將程序設計爲,從零散的空間取到向量指針,並組成一個數組或鏈表系統來管理。天然來講,這樣節省了內存的冗餘,都引用向一份數據。但是如果引入一些冗餘,將所需要的向量數據一同拷貝構成連續空間,對於檢索時的遍歷計算會帶來明顯的性能提升。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.2 併發訪問優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提到併發訪問,可能要先從一個概念,緩存行(cache line)說起。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了避免頻繁的主存交互,其實緩存體系採用了類似malloc的方法,即劃分一個最小單元,叫做緩存行(主流CPU上一般64B),所有內存到緩存的操作,以緩存行爲單位整塊完成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如對於連續訪問來說第一個B的訪問就會觸發全部64B數據都進入L1,後續的63B訪問就可以直接由L1提供服務了。所以併發訪問中的第一個問題就是要考慮緩存行隔離,也就是一般可以認爲,位於不同的兩個緩存行的數據,是可以被真正獨立加載/淘汰和轉移的(因爲cache間流轉的最小單位是一個cache line)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"典型的問題一般叫做false share現象,也就是不慎將兩個本無競爭的數據,放置在一個緩存行內,導致因爲體系結構的原因,引入了『本不存在的競爭』。這點在網上資料比較充足,例如brpc和disruptor的設計文檔都比較詳細地講解了這一點,因此這裏就不再做進一步的展開了。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.3 那先來聊聊緩存一致性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排除了false share現象之後,其餘就是真正的共享問題了,也就是確實需要位於同一個緩存行內的數據(往往就是同一個數據),多個核心都要修改的場景。由於在多核心繫統中cache存在多份,因此就需要考慮這多個副本間一致性的問題。這個一致性一般由一套狀態機協議保證(MESI及其變體)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ac/ac52bf244a6a367dceaa734ab2533241.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大體是,當競爭寫入發生時,需要競爭所有權,未獲得所有權的核心,只能等待同步到修改的最新結果之後,才能繼續自己的修改。這裏要提一下的是有個流傳甚廣的說法是,因爲緩存系統的引入,帶來了不一致,所以引發了各種多線程可見性問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這麼說其實有失偏頗,MESI本質上是一個『一致性』協議,也就是遵守協議的緩存系統,其實對上層CPU多個核心做到了順序一致性。比如對比一下就能發現,緩存在競爭時表現出來的處理動作,其實和只有主存時是一致的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/75/75606e58425be3db7007024f701af510.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只是阻塞點從競爭一個物理主存單元的寫入,轉移到了雖然是多個緩存物理單元,但是通過協議競爭獨佔上。不過正因爲競爭阻塞情況並沒有緩解,所以cache的引入其實搭配了另一個部件也就是寫緩衝(store buffer)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注:寫緩存本身引入其實同時收到亂序執行的驅動,在《併發篇》會再次提到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/52/52e357e531e6ae6e8720f0cf652f7f4c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫緩衝的引入,真正開始帶來的可見性問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以x86爲例,當多核發生寫競爭時,未取得所有權的寫操作雖然無法生效到緩存層,但是可以在改爲等待在寫緩衝中。而CPU在一般情況下可以避免等待而先開始後續指令的執行,也就是雖然CPU看來是先進行了寫指令,後進行讀指令,但是對緩存而言,先進行的是讀指令,而寫指令被阻塞到緩存重新同步之後才能進行。要注意,如果聚焦到緩存交互界面,整體依然是保證了順序一致,但是在指令交互界面,順序發生了顛倒。這就是典型的StoreLoad亂序成了LoadStore,也是x86上唯一的一個亂序場景。而針對典型的RISC系統來說(arm/power),爲了流水線並行度更高,一般不承諾寫緩衝FIFO,當一個寫操作卡在寫緩衝之後,後續的寫操作也可能被先處理,進一步造成StoreStore亂序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"寫緩衝的引入,讓競爭出現後不會立即阻塞指令流,可以容忍直到緩衝寫滿。但因爲緩存寫入完成需要周知所有L1執行作廢操作完成,隨着核心增多,會出現部分L1作廢長尾阻塞寫緩衝的情況。因此一些RISC系統引入了進一步的緩衝機制。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e703c1bfd98ade0c4c09d1b3b5287358.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"進一步的緩衝機制一般叫做失效隊列,也就是當一個寫操作只要將失效消息投遞到每個L1的失效隊列即視爲完成,失效操作長尾不再影響寫入。這一步改動甚至確實地部分破壞了緩存一致性,也就是除非一個核心等待當前失效消息排空,否則可能讀取到過期數據。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到這裏已經可以感受到,爲了對大量常規操作進行優化,近代體系結構設計中引入了多個影響一致性的機制。但是爲了能夠構建正確的跨線程同步,某些關鍵節點上的一致性又是必不可少的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,配套的功能指令應運而生,例如x86下mfence用於指導後續load等待寫緩衝全部生效,armv8的lda用於確保後續load等待invalid生效完成等。這一層因爲和機型與指令設計強相關,而且指令的配套使用又能帶來多種不同的內存可見性效果。這就大幅增加了工程師編寫正確一致性程序的成本,而且難以保證跨平臺可移植。於是就到了標準化發揮作用的時候了,這個關於內存一致性領域的標準化規範,就是內存序(memory order)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.4 再談一談memory order","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作爲一種協議機制,內存序和其他協議類似,主要承擔了明確定義接口層功能的作用。體系結構專家從物理層面的優化手段中,抽象總結出了多個不同層級的邏輯一致性等級來進行刻畫表達。這種抽象成爲了公用邊界標準之後,硬件和軟件研發者就可以獨立開展各自的優化工作,而最終形成跨平臺通用解決方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於硬件研發者來說,只要能夠最終設計一些特定的指令或指令組合,支持能夠實現這些內存序規範的功能,那麼任意的設計擴展原理上都是可行的,不用考慮有軟件兼容性風險。同樣,對於軟件研發者來說,只要按照標準的邏輯層來理解一致性,並使用正確的內存序,就可以不用關注底層平臺細節,寫出跨平臺兼容的多線程程序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內存序在官方定義裏,是洋洋灑灑一大篇內容,爲了便於理解,下面從開發程序須知角度,抽出一些簡潔精煉的概念(雖不是理論完備的)來輔助記憶和理解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先來看看,內存序背後到底發生了啥。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"int payload = 0;\nint flag = 0;\nvoid normal_writer(int i) {\n payload = flag + i;\n flag = 1;\n}\nint normal_reader() {\n while (flag == 0) {\n }\n return payload;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7d/7d2fa6e4220491bc4109338bc6949c80.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個樣例中可以看到,在編譯層,默認對於無關指令,會進行一定程度的順序調整(不影響正確性的前提下)。另一方面,編譯器默認可以假定不受其他線程影響,因此同一個數據連續的多次內存訪問可以省略。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面看一下最基礎的內存序等級,relaxed。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"int payload = 0;\nstd::atomic flag {0};\nvoid relaxed_writer(int i) {\n payload = flag.load(std::memory_order_relaxed) + i;\n flag.store(1, std::memory_order_relaxed);\n}\nint relaxed_reader() {\n while (flag.load(std::memory_order_relaxed) == 0) {\n }\n return payload;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/44/44aa696faf8c5cd404407c4050b2616d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在使用了基礎的內存序等級relaxed之後,編譯器不再假設不受其他線程影響,每個循環都會重新加載flag。另外可以觀測到flag和payload的亂序被恢復了,不過原理上relaxed並不保證順序,也就是這個順序並不是一個編譯器的保證承諾。總體來說,relaxed等級和普通的讀寫操作區別不大,只是保證了對應的內存訪問不可省略。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更進一步的內存序等級是consume-release,不過當前沒有對應的實現案例,一般都被默認提升到了下一個等級,也就是第一個真實有意義的內存序等級acquire-release。先從原理上講,一般可以按照滿足條件/給出承諾的方式來簡化理解,即:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要求:對同一變量M分別進行寫(release)A和讀(acquire)B,B讀到了A寫入的值。承諾:A之前的所有其他寫操作,對B之後的讀操作可見。實際影響:涉及到的操作不會發生穿越A/B操作的重排;X86:無額外指令;ARMv8:A之前排空store buffer,B之後排空invalid queue,A/B保序;ARMv7&Power:A之前全屏障,B之後全屏障。","attrs":{}}]},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"int payload = 0;\nstd::atomic flag {0};\nvoid release_writer(int i) {\n payload = flag.load(std::memory_order_relaxed) + i;\n flag.store(1, std::memory_order_release);\n}\nint acquire_reader() {\n while (flag.load(std::memory_order_acquire) == 0) {\n }\n return payload;\n}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c6/c61f90a185006324be0984da71db3f84.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9b/9bd43ba15b1b6f0ec2c698463167564c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由於x86默認內存序不低於acquire-release,這裏用ARMv8彙編來演示效果。可以看出對應指令發生了替換,從st/ld變更到了stl/lda,從而利用armv8的體系結構實現了相應的內存序語義。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再進一步的內存序,就是最強的一級sequentially-consistent,其實就是恢復到了MESI的承諾等級,即順序一致。同樣可以按照滿足條件/給出承諾的方式來簡化理解,即:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要求:對兩個變量M,N的(Sequentially Consistent)寫操作Am,An。在任意線程中,通過(Sequentially Consistent)的讀操作觀測到Am先於An。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"承諾:在其他線程通過(Sequentially Consistent)的讀操作B也會觀測到Am先於An。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際影響:","attrs":{}}]}]}],"attrs":{}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"X86:Am和An之後清空store buffer,讀操作B無額外指令;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"ARMv8:Am和An之前排空store buffer, B之後排空invalid queue,A/B保序;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"ARMv7:Am和An前後全屏障,B之後全屏障;","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"POWER:Am和An前全屏障,B前後全屏障。","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"值得注意的是,ARMv8開始,特意優化了sequentially-consistent等級,省略了全屏障成本。推測是因爲順序一致在std::atomic實現中作爲默認等級提供,爲了通用意義上提升性能做了專門的優化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"4.5 理解memory order如何幫助我們","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先給出一個基本測試的結論,看一下一組對比數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"多線程競爭寫入近鄰地址sequentially-consistent:0.71單位時間","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"多線程競爭寫入近鄰地址release:0.006單位時間","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"多線程競爭寫入cache line隔離地址sequentially-consistent:0.38單位時間","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","text":"多線程競爭寫入cache line隔離地址release:0.02單位時間","attrs":{}}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏可以看出,做cache line隔離,對於sequentially-consistent內存序下,有一定的收益,但是對release內存序,反而有負效果。這是由於release內存序下,因爲沒有強內存屏障,寫緩衝起到了競爭緩解的作用。而在充分緩解了競爭之後,因爲cache line隔離引入了相同吞吐下更多cache line的傳輸交互,反而開銷變大。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這個信息指導下,我們在實現無鎖隊列時,採用了循環數組 + 分槽位版本號的模式來實現。因爲隊列操作只需要acquire-release等級,分槽位版本號間無需採用cache line隔離模式設計,整體達到了比較高的併發性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0b/0ba9464a16aac908835daa3000c5244c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"horizontalrule","attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"百度Geek說","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"百度官方技術公衆號上線啦!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"技術乾貨 · 行業資訊 · 線上沙龍 · 行業大會","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"招聘信息 · 內推信息 · 技術書籍 · 百度周邊","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"歡迎各位同學關注!","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章