Ray 分佈式計算框架詳解

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ray 是 UC Berkeley RISELab 出品的機器學習分佈式框架。UC Berkeley 教授 Ion Stoica 寫了一篇文章:"},{"type":"link","attrs":{"href":"https://anyscale.io/blog/the-future-of-computing-is-distributed/","title":null},"content":[{"type":"text","text":"The Future of Computing is Distributed"}]},{"type":"text","text":"。裏面詳細說了 Ray 產生的原由。總結一下,就是由於 AI 和大數據的快速發展,對於應用和硬件能力的要求提出了更高的挑戰。需要有更適合的軟件架構來匹配大規模實時的計算需求。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ion Stoica 同時也是 Spark 產品的公司 Databricks 的創始人,Apache Mesos、Alluxio、Clipper 的項目主導人"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Ray 的特點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們從瞭解 Ray 到實踐上線,大概有1年半的時間。爲什麼會找到 Ray ,主要還是基於高性能計算的需求。我們的場景主要是做投資的實時歸因分析,由於涉及的數據量很大,對於計算的要求也很高。還有一個關鍵的門檻,金融模型大量使用 Pandas 和 Numpy 來做矩陣計算,需要針對 Pandas/Numpy 有更好的支持。實踐下來,我覺得 Ray 有如下特點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分佈式異步調用"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"內存調度"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Pandas/Numpy 的分佈式支持"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持 Python"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整體性能出衆"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"存在很多和 Ray 類似的框架,但是如果把範圍縮小,針對 python 用戶,類似的主要就是 Dask、Mars、Celery 等 。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"和 Dask 對比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dask 是 Anaconda 的產品,背後的主要貢獻者是 Matthew Rocklin 。Dask 的 "},{"type":"link","attrs":{"href":"https://blog.dask.org/","title":null},"content":[{"type":"text","text":"blog"}]},{"type":"text","text":" 和 Matthew Rocklin 的 "},{"type":"link","attrs":{"href":"https://matthewrocklin.com/blog/","title":null},"content":[{"type":"text","text":"Blog"}]},{"type":"text","text":" 可以經常去看看,更新很頻繁,內容很不錯。最近 Matthew Rocklin 加入了 Nvidia ,開始做Dask_cudf,開發基於 GPU 的 Dask,結合 "},{"type":"link","attrs":{"href":"https://github.com/rapidsai/cudf","title":null},"content":[{"type":"text","text":"Rapids cudf"}]},{"type":"text","text":"(基於 GPU 的 Pandas)。最新的 blog 裏面,顯示他準備"},{"type":"link","attrs":{"href":"http://matthewrocklin.com/blog/work/2020/01/08/founding-1","title":null},"content":[{"type":"text","text":"建立一個 Dask 的公司"}]},{"type":"text","text":",推進 python 的分佈式數據平臺"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dask 和核心是彌補 python 在數據科學中的不足,主要是性能上。Python 單機的能力不能夠支持數據科學中大數據集的快速計算。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Dask 提供了基礎的數據結構,底層是分佈式計算架構。數據結構包括:Array 、Dataframe。Array 兼容 numpy 的ndarray,Dataframe 兼容 Pandas 的 Dataframe 。\"兼容\"是個相對的說法,畢竟 Pandas 和 Numpy 發展多年,本身也在發展,接口非常多。Dask 應該是在兼容這塊做的非常好的,但是和 Pandas/Numpy 還是有差異,這一點要注意。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"就像剛纔提到的,Dask 的目標是爲了彌補 Python 在數據科學上的不足。而 Ray 的出發點是爲了加速機器學習的調優和訓練的速度。Ray 除了基礎的計算平臺,還包括 Tune (超參數調節) 和 RLlib (增強學習)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲數據科學和機器學習基礎都是 Python 也是核心,所以分佈式和對 Python Pandas/Numpy 的支持是這兩個框架的基礎。但是有一點不一樣的,也是我們決定要採用 Ray 的核心原因。Ray 的底層內存數據結構的基礎是 Apache Arrow,而 Dask 是 xarray/xray 。xray 是 NumFocus 贊助的開源類似 numpy 的數據結構。Apache Arrow 擁有更好的生態,被大部分的數據處理系統接受,非常有利於和其他系統的融合"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Apache Arrow 和 Plasma"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache Arrow 是列式內存數據結構,已經成爲數據處理領域最通用的數據結構。Arrow 最突出的特點就是生態非常豐富和性能出衆"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Arrow 的背後還有一位 committer 和 co-creator,叫 "},{"type":"link","attrs":{"href":"https://wesmckinney.com/pages/about.html","title":null},"content":[{"type":"text","text":"Wes McKinney"}]},{"type":"text","text":" 。Wes McKinney 是 pandas 的作者,所以 Arrow 對 Pandas 的支持也非常好"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ray 團隊基於 Arrow 開發了一個內存數據服務,叫做 Plasma 。Plasma 在 Linux 共享內存創建了 Arrow 封裝的對象,單獨作爲一個進程運行。其他進程可以通過 Plasma Client Library 來訪問這塊共享內存裏的 Arrow 存儲。這個功能是 Ray 團隊開發的,貢獻給 Arrow 作爲 Arrow 生態的一部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了 Plasma ,Ray 團隊還貢獻了一個叫做 Modin 的功能,就是基於 Ray 的分佈式能力,提供了 Pandas 的實現。類似於 Dask 的 Dataframe。這個功能已經從 Ray 獨立,成爲一個獨立的項目"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"軟件架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ray 的基礎結構可以參考 Paper "},{"type":"link","attrs":{"href":"https://arxiv.org/abs/1712.05889","title":null},"content":[{"type":"text","text":"https://arxiv.org/abs/1712.05889"}]},{"type":"text","text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b7/b7d6089e468c590996a26f9b0afa1657.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過幾個版本的迭代,有些內容做了一些優化,主要的結構還是如上圖。GCS 作爲集中的服務端,是 Worker 之間傳遞消息的紐帶。每個 Server 都有一個共用的 Object Store,也就是用 Apache Arrow/Plasma 構建的內存數據。 Local Scheduler 是 Server 內部的調度,同時通過 GCS 來和其他 Server 上的 Worker 通信。Object Store 時間也有通信,作用是傳遞 Worker 之間的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Paper 裏面描述了一個典型的遠程調用流程:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/30/30aa6b38404a79b18873d4f3e0e8fe73.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到,GCS 儲存了代碼、輸入參數、返回值。Worker 通過 Local Scheduler 來和 GCS 通信。Local Scheduler 就是 Raylet, 是單機上的基礎調度服務。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/82/82f63f3ccd5e3ebd83371a9bfd3b9e95.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Object > 100 kb 會通過 Object Store 之間的並行 RPC 來傳輸,而不通過任務調度 RPC 來實現。Apache Arrow 在 0.15 之後提供了一個 Apache Arrow Flight 的 RPC 框架,0.16 又做了強化。不知道 Ray 的 Object 的並行傳遞是不是採用 Arrow Flight。下圖是一個 任務調度的 RPC 示例圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/42/426716b6549561210fa4a61c7e23b32f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上兩個示意圖來自於"},{"type":"link","attrs":{"href":"https://medium.com/distributed-computing-with-ray/how-ray-uses-grpc-and-arrow-to-outperforsm-grpc-43ec368cb385","title":null},"content":[{"type":"text","text":"How Ray Uses gRPC and Arrow to outperform gRPC"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面我們詳細來了解了一下 Raylet"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Raylet"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我重新畫了一個簡單一點的 Worker 和 GCS 的關係圖:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9c/9c9bc7b89bad7232b4ee8995f48a9b5b.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Raylet 在中間的作用非常關鍵,Raylet 包含了幾個重要內容:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Node Manager"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Object Manager"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"gcs_client 或者 gcs server"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Node Manager 是基於 boost::asio 的異步通信模塊,主要是通信的連接和消息處理管理;Object Manager 是 Object Store 的管理;gcs_client 是連接 GCS 客戶端。如果設置RAY_GCS_SERVICE_ENABLED 爲 True 的話 ,這個 Server 就是 作爲 GCS 啓動。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們先看一下 Raylet 的啓動過程:"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b2/b2bc6e7533a4b8a005ac3f1277d87442.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,要做 Raylet 的初始化,這裏麪包含很多參數,包括 Node Manager 和 gcs client 的初始化。然後 Raylet Start 之後,註冊 GCS,準備接收消息。一旦有消息進來,就進入 Node Manager 的 ProcessClientMessage 過程。在解釋 ProcessClientMessage 的操作之前,我們需要了解一下 Ray Worker 和 Raylet 的進程/線程和通信的模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"通信模型"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ray 採用的是 Boost::asio 的異步通信模型,這裏有一個很豐富全面的關於 "},{"type":"link","attrs":{"href":"http://spiritsaway.info/asio-implementation.html","title":null},"content":[{"type":"text","text":"asio 的介紹"}]}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e3/e3298e1b5839889fb8ff05a82f11e846.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Asio 採用的是 Proactor 模型。一個操作經過 Initiator 之後分解爲 Asynchronous Operation Processor(AOP) 、Asynchronouse Operation(AO) 和 Completion Hanlder(CH) 。AOP 做具體的工作,執行異步操作。執行完成之後,把結果放入Completion Event Queue(CEQ)。Asynchronous Event Demultiplexer(AED)等待 CEQ ,如果 CEQ 出現完成事件,則返回一個完成事件到 CH"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Raylet 啓動了一個 main_service , 是 "},{"type":"codeinline","content":[{"type":"text","text":"boost::asio::io_service"}]},{"type":"text","text":" 。io_service 也是 asio 運轉的核心組件。前面的AOP、AED 和 Proactor 都是由 io_service 串聯起來的。io_service 內部實現了一個任務隊列,隊列的任務就是void(void) 函數"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"io_service 的接口有 run 、run_one、poll、poll_one、stop、reset 、 dispatch 、post 。run 方式就是輪詢執行隊列裏面的所有任務,無任務執行的時候就 epoll_wait 上阻塞等待"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"\n// Initialize the node manager.\nboost::asio::io_service main_service;\nmain_service.run();"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Node Manager 在初始化的時候,會按照 num_initial_workers 的數量初始化 worker pool 。然後Node Manager 會按照 asio 的異步機制,分配任務到這些 worker pool 裏面的進程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們看一下 Raylet 、Worker 和 GCS 的消息傳遞和調度機制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"消息傳遞和調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ray 後面的公司 Anyscale.io 的 blog 有一篇文章,叫做 "},{"type":"link","attrs":{"href":"http://anyscale.io/wp-content/uploads/2020/02/Edward-Oakes-Anyscale-RayMeetup-20200130-3.pdf","title":null},"content":[{"type":"text","text":"Fast Scheduling in Ray 0.8"}]},{"type":"text","text":" 。講了怎麼在 ray 0.8 裏面優化調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e3/e3a29e267bdc3e94cdc535a7d90ab021.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Worker 提交 task 到 raylet,raylet 分配 task 到其他 worker。同時 raylet 還需要把 task 、相關 worker 信息提交給 GCS。task 執行的參數和返回都需要通過 Object Store 來獲取"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,我們看一下詳細的消息傳遞和對應的一些執行過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先看一下 Submit Task 這個操作: Worker 提交一個 Task ,就調用 SubmitTask 的任務到 Raylet 。Task 在 Raylet 內部有一個 Lineage 的機制。這個也是上面 Anyscale 圖裏面的 task lineage"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們先了解一下 Task Lineage 的機制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Task Lineage"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Task Lineage 裏面包含幾個概念,Lineage Cache 、Lineage Entry 和 Lineage 。Lineage 是管理 Task 執行的 DAG (有向無環圖) ;Lineage Entry 是對 Task 狀態的一些管理;Lineage Cache 是對 Task 在本機執行緩存的管理。在上面 "},{"type":"link","attrs":{"href":"http://anyscale.io/wp-content/uploads/2020/02/Edward-Oakes-Anyscale-RayMeetup-20200130-3.pdf","title":null},"content":[{"type":"text","text":"Fast Scheduling in Ray 0.8"}]},{"type":"text","text":" 文章裏面,主要就是通過對 Lineage 的優化來提升 Ray 0.8 的調度性能"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a7/a7ee922366e197c1e58b3081daeb865f.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Ray 0.8 裏面,把調用其他 Worker 的流程,從 Raylet 到 GCS 然後到 woker ,改爲直接查詢 Lineage Cache,如果 Worker 曾經調用過,就直接請求對應的 Worker。減少調用路徑,提升效率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回到我們對 Lineage 的分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Task 在 GCS 裏面有幾個狀態:None、Uncommitted、Committing、Committed 。None 意思是在 Lineage Cache 裏面不存在;當任務從 Woker 提交之後,是 uncommited 狀態了;當任務發生一些變化,經過一些操作或者重新提交,就是 Committing 狀態。意思就是正在進行 Committing,等待返回狀態;提交的任務得到了反饋,就是 Committed 狀態。但是有一個不同,任務沒有刪除,當下一個任務還是調度這個 Worker 的時候,就可以直接調用這個 Task Entry 來實現。這就是上面說的優化過程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"TaskEntry 保存 Task 的狀態和相關的聯繫。主要包含這麼幾個內容:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GcsStatus:就是上面說的 Task 的狀態"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"parent_task_ids_:一個 Set ,保存了 Task 的父任務 ID 列表"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"forwarded_to_:一個 Set,保存了任務明確提交之後提交到的 Node Manager 的 ID 列表"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lineage 維護了兩個 map。一個是 Task 和 LineageEntry 的 map;一個是 TaskID 和 TaskID Set 的 Map。第二個的意思就是 Task 和它子 Task 組的映射"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LineageCache 是 Task 的 Cache Table 。包含了 Task 的信息和狀態。Lineage Cache 的策略是把所有的任務成爲 Uncommitted 狀態。爲了安全起見,只有當 Task 的父任務都刪除了,子任務才能刪除"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Lineage 的細節還很多,而且還處在優化的狀態。我們先看看通過一個 Task 提交的過程來看看 Lineage 是怎麼運轉的"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"提交任務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Submit Task 之後,先記錄增加了一個 Task。然後拿到需要提交的 Task Spec,就是 Task 詳細信息。然後提交。Task 有幾個狀態:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Placeable:就緒的狀態,可以分配到 Node, 可以是本地或者遠端。分配的原則根據資源狀況,例如本地的內存、是否超過 Task 最大數量等。如果本地資源不夠,就會提交到其他的 Node ,也就是服務器。當然,如果其他 Node 資源也不夠,就會繼續分配。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"WaitForActorCreation:這個轉改是針對 Actore Task ,代表 Actor 方法等待 Actore 完成返回"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Waiting:Task 在等待它的參數的依賴關係滿足要求。也就是 Task 的參數需要放到 local object store"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ready:Task 可以運行,所有的參數已經傳輸到 local object store 了"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Running:Task 已經分配冰運行到一個 worker"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Blocked:Task 暫停。可能是因爲 Task 正在等待啓動其他 Task 並且等待結果返回"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Infeasible:Task 所需要的資源所有機器都不滿足"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Swap:兩個狀態中的一個轉換狀態。例如一個 Ready 狀態的 Task ,提交到了一個 worker,在等待返回的時候。就處在 Swap 狀態。如果 Worker 接收了這個 Task,task 狀態會變爲 Running,否則它就會返回到 Ready 狀態"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 design_docs/task_states.rst 文檔裏面有一個圖描述了 Task 的狀態變化過程:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6a/6ab60baa874e1f371243278540e7229b.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 SubmitTask 最後:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"// if the task was forwarded.\n\nif (forwarded) {\n // Check for local dependencies and enqueue as waiting or ready for dispatch.\n} else {\n // (See design_docs/task_states.rst for the state transition diagram.)\n local_queues_.QueueTasks({task}, TaskState::PLACEABLE);\n ScheduleTasks(cluster_resource_map_);\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果要提交的 Task 需要 forward (在收到 HandleForwardTask 操作的時候),則進行 Task 如隊列操作。入隊的時候,如果參數都滿足,也就是本地資源足夠。Task 就入隊列,成爲 READY 狀態,如果不滿足,就是 WAITING 狀態。同時改變 Task 狀態爲 Pending"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"if (args_ready) {\n local_queues_.QueueTasks({task}, TaskState::READY);\n DispatchTasks(MakeTasksByClass({task}));\n } else {\n local_queues_.QueueTasks({task}, TaskState::WAITING);\n }\n\ntask_dependency_manager_.TaskPending(task);"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"調度策略"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 SubmitTask 之後,如果不是 forward ,則執行兩個操作:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"local_queues_.QueueTasks({task}, TaskState::PLACEABLE);\nScheduleTasks(cluster_resource_map_);"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個是把 Task 放到本地,並且把 Task 狀態置爲 Placeable;第二是把 Task 在集羣進行調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Ray 針對 Task 有兩個 Queue:ReadyQueue、SchedulingQueue 。 ReadyQueue 是已經準備好的 Task 的隊列;SchedulingQueue 是已經提交的 Task 的隊列。這兩個隊列用來存儲不同狀態的 Task,實現上面說的 Task 狀態變化過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調度任務的步驟是這樣:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":"","normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"先嚐試把 Tasks 放在 Local Node"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"如果 Local Node 有資源"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","text":"如果沒有合適的計算資源,就採用硬分配的方式,給 Client 安排計算資源"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,調度就是基於資源的分配。資源包括計算、內存/數據,在 Ray 體現爲 Worker 、Task 、Object Store(Arrow)。所以我們需要搞清楚資源,才能更好的理解調度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"集羣架構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"按照上面的描述,Ray 集羣有 Worker 、Gcs 和 Raylet 等模塊。Worker 是一個執行單元。 Worker 的執行是通過 gRPC 來遠程提交的。整個架構有點像 istio 的 service mesh 的結構"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/05/05268cbb17a457ea5a50d37848a2b0af.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對應以上的粗粒度的組件,拆解開來就像下面這樣:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/41/41c5a296e3936a477def0dc472a0dc3a.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏面有幾個關鍵組件。Raylet 是處理 Worker 和 GCS 的關鍵連接點,還有處理 Local Worker 之間的調度。Raylet 裏面包含 Node Manager,這是處理消息傳遞和調度的基礎模塊;還有 Object Manager ,這是處理本機 Arrow 內存讀取的組件,相對容易理解;Core worker 組件針對 Python Driver 提供支持,主要是完成 Task 的調度。就是 python 裏面使用 ray 時候需要加的 remote 註解。這個是 Ray 的核心。Python Driver 主要是針對 python 提供支持,當然 Ray 也有 Java Driver ,這裏沒有列出"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們先從 Raylet 看起"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Raylet"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Raylet 初始化的時候,初始化了一個 main_service 。 這是一個"},{"type":"codeinline","content":[{"type":"text","text":"boost::asio::io_service"}]},{"type":"text","text":" 實例。這個在上面的通信模型裏面簡單描述了一下 asio 的機制。main_service 在 main.cc 啓動("},{"type":"codeinline","content":[{"type":"text","text":"main_service.run()"}]},{"type":"text","text":"),main_service 的引用傳遞到了 Raylet ,然後 Raylet 應用傳遞到了 Node Manager"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Node Manager(下稱 NM)是 Raylet 的一個負責通信的模塊,處理 Raylet 和其他分佈式節點(服務器)、Worker、Task 分配還有 GCS 的通信"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從 Raylet 到 Node Manager 的入口在 HandlerAccept :"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"ClientHandler client_handler =\n [this](LocalClientConnection &client) { node_manager_.ProcessNewClient(client); };\n MessageHandler message_handler =\n [this](std::shared_ptr client, int64_t message_type,\n const uint8_t *message) {\n node_manager_.ProcessClientMessage(client, message_type, message);\n };\n // Accept a new local client and dispatch it to the node manager.\n auto new_connection = LocalClientConnection::Create(\n client_handler, message_handler, std::move(socket_), \"worker\",\n node_manager_message_enum,\n static_cast(protocol::MessageType::DisconnectClient));"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"client_handler 是處理連接請求,message_handler 是處理這個 Client 的消息。LocalClientConnection 是一個針對客戶端請求到服務端的抽象,主要是基於 asio 機制把讀寫,和異步讀寫封裝了一下"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ProcessNewClient 主要是記錄 Client 的一些信息 ProcessClientMessage 就是上面架構圖裏面的消息處理,對應不同的消息處理流程。可以看一下附錄:《Node Manager 處理消息的列表》"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏面最重要的一個,是 SubmitTask。是針對 task 的處理,Task 作爲主要任務調度的模塊,貫穿 Ray 分佈式任務調度的全過程。所以我們有必要從源頭來了解和跟蹤一下 Task 的發起到完成的整個過程。同時,我們也可以通過這個過程,瞭解從 Python Driver 到 Core Worker ,然後到 Raylet 的處理過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"Submit Task"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Task 是表示一個任務及其執行的資源等信息。Task 的發起是從 Python Driver"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"@ray.remote\ndef borrower(inner_ids):\n inner_id = inner_ids[0]\n ray.get(foo.remote(inner_id))\n\ninner_id = ray.put(1)\nouter_id = ray.put([inner_id])\nres = borrower.remote(outer_id)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如以上的代碼,"},{"type":"codeinline","content":[{"type":"text","text":"@ray.remote"}]},{"type":"text","text":" 註解下面的函數,就是一個執行體。對應的是 RemoteFuntion Class 。在 _remote 這一段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"self._pickled_function = pickle.dumps(self._function)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"_function 序列化爲 _pickled_funtion,然後再 hash 爲 pickled_function_hash"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"self._function_descriptor = PythonFunctionDescriptor.from_function(\n\tself._function, self._pickled_function)\n\ndef from_function(cls, function, pickled_function):\n pickled_function_hash = hashlib.sha1(pickled_function).hexdigest()"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後就調用了 Core Worker 的 SubmitTask"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"Status SubmitTask(const RayFunction &function, const std::vector &args,const TaskOptions &task_options, std::vector *return_ids,int max_retries);"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 SubmitTask 裏面。生成一個 task id ,然後通過 BuildCommonTaskSpec 函數,把 Task 所有信息封裝成一個 TaskSpecification 實例。然後把這個 TaskSpecification 提交到 Task 的任務隊列裏面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"if (task_options.is_direct_call) {\n task_manager_->AddPendingTask(GetCallerId(), rpc_address_, task_spec, max_retries);\n return direct_task_submitter_->SubmitTask(task_spec);\n } else {\n return local_raylet_client_->SubmitTask(task_spec);\n }"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏面 Task Manager 是對 Task 管理的一個封裝。包含了對應的內存 "},{"type":"codeinline","content":[{"type":"text","text":"in_memory_store_"}]},{"type":"text","text":"、引用計數 "},{"type":"codeinline","content":[{"type":"text","text":"reference_counter_"}]},{"type":"text","text":"(主要用作對 ObjectID 的管理,用在 GC 上),任務的狀態和 Retry 次數等。這裏面 "},{"type":"codeinline","content":[{"type":"text","text":"task_manager->AddPendingTask"}]},{"type":"text","text":" ,主要是針對 Task 提交前做了記錄,記錄 Task ID,爲了 reference_manager_ 之後的 GC 用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"codeinline","content":[{"type":"text","text":"is_direct_call"}]},{"type":"text","text":" 是針對 Actor worker 的直接調用。"},{"type":"codeinline","content":[{"type":"text","text":"local_raylet_client_"}]},{"type":"text","text":" 就是上面提到的 Raylet Client,Core worker 把接收到的 remote 調用提交到 Raylet ,Raylet 來做調度。就是下圖紅色的那一段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/af/aff454c3848af774021378bc44ff9be1.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 Raylet 的 Node Manager 接收到 SubmitTask 消息,按照 Task 的依賴次序來提交 Task。意思就是,如果一個 Task B 依賴於另外一個 Task A,那就先提交 Task A"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果任務是提交到另外一個 Node(這個取決於 Lineage 調度,forwarded 是 True,forwarded 是 SubmitTask 的最後一個參數),則在 Lineage Cache 增加一個 UncommittedLineage"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"lineage_cache_.AddUncommittedLineage(task_id, uncommitted_lineage)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏面第二個參數,是 SubmitTask 的時候,生成的一個 Lineage 的實例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果任務是提交到本地(forwarded 是 False,默認),則異步 commit task 到 GCS:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"cpp"},"content":[{"type":"text","text":"lineage_cache_.CommitTask(task)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調用的是 Lineage 的 CommitTask,然後調用 Lineage 的 FlushTask,接着調用 "},{"type":"codeinline","content":[{"type":"text","text":"gcs_client_->Tasks().AsyncAdd"}]},{"type":"text","text":" 把 Task 狀態提交到 GCS,然後根據返回狀態更新本地 Lineage 的 Task 狀態爲 GcsStatus::COMMITTED。同時 Evict Task 和 UnSubscribeTask"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在處理好 Lineage Cache 之後,SubmitTask 在本地的 Task Queue 裏面增加這個 Task,然後調用 ScheduleTasks 來調度()"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個過程流程圖如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c6/c609c08c84626ec2a603c2736deff8dd.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 DispatchTasks 裏面,按照 class order 分配。避免其中一個任務執行的時候卡住,導致 Ray 啓動多個 worker 來執行。這個問題在 "},{"type":"link","attrs":{"href":"https://github.com/ray-project/ray/issues/3644","title":null},"content":[{"type":"text","text":"#3644"}]},{"type":"text","text":" 有描述。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考文檔"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://arxiv.org/abs/1712.05889","title":null},"content":[{"type":"text","text":"Ray Paper"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b6/b6d914f927f1a87a419fa2dfd27100a1.png","alt":null,"title":"","style":[{"key":"width","value":"100%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章