【spark報錯排查】ERROR YarnScheduler: Lost executor 2 on url: Container container_1

原創

2020-06-16 10:03

近期工作中，遇到了一個ERROR特別頭疼，經過多次實驗，總於把它解決了，因此記錄之~

具體error日誌如下：

20/06/10 11:19:41 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container container_1587136360493_239335_01_4194305 on host: xxxxxxxx.com was preempted.
20/06/10 11:19:41 ERROR YarnScheduler: Lost executor 1 on xxxxxxxxxxx.com: Container container_1587136360493_239335_01_4194305 on host: xxxxxxxxx.com was preempted.
20/06/10 11:19:41 INFO TaskSetManager: Task 8 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
20/06/10 11:19:41 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/06/10 11:19:41 INFO BlockManagerMaster: Removal of executor 1 requested
20/06/10 11:19:41 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 1

一直在網上查，但是沒有結果，後面才定位到原來是executor 1中佔用內存太大，超過了預先分配的內存，導致OOM了。實質上的錯誤是這個：java.lang.OutOfMemoryError: Java heap space

1. 問題

我們用了sc.wholeTextFiles(data_path)的方式讀取文件，這種方式主要用讀取許多小文件，讀取大文件時，有一個致命的問題，就是如果只讀一個文件，它默認就是用一個executor來讀取，這個時候就會存在一個問題，那就是如果這個文件很大，這個讀取的executor很容就出現上述的OOM，進而導致executor被搶佔了（preempted）。

2. 解決辦法

最好的解決辦法就是將wholeTextFiles的讀取方式改成textFile的讀取方式，
首先我們看一下pyspark（https://www.cnblogs.com/wenBlog/p/6323678.html）中textFile的參數介紹
textFile(name, minPartitions=None, use_unicode=True)
可以看到，除了路徑外，還有一個minPartitions，表示最小的分區數，我們可以對這個minPartitions設置爲1000，這樣就相當於最少有1000個tasks去讀取這個路徑下的文件了，這樣即使有大文件，也會有1000個tasks將該大文件切分成很多塊，每塊讀取時所佔用的內存就會比較小了。
具體命令

sc.textFile(data_path, 1000)

這裏用了1000個task去讀取大文件，這裏要注意區分一下各個名詞，每個executor可以執行很多個task，這裏用1000個task去讀取大文件，不是1000個task並行讀取，並行讀取的設置參數是：executor-cores

具體的spark調優的指導可以參考：Spark性能優化指南——基礎篇（https://tech.meituan.com/2016/04/29/spark-tuning-basic.html）、Spark性能優化指南——高級篇（https://tech.meituan.com/2016/05/12/spark-tuning-pro.html）

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【spark報錯排查】ERROR YarnScheduler: Lost executor 2 on url: Container container_1

1. 問題

2. 解決辦法

lightdb hash index的性能和限制

Markdown博客編輯器的說明

leetcode—從兩個有序數組中尋找他們並集的第k小元素（思路）

劍指Offer——面試題36：數組中的逆序對

劍指Offer——面試題28：字符串的排列

【spark報錯排查】ERROR YarnScheduler: Lost executor 2 on url: Container container_1

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結