Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used

原創

lmb633

2018-12-07 13:29

執行spark時遇到這種問題，最開始--executor-memory 設爲10G，到後來20G，30G，還是報同樣的錯誤。

1.一種解決方法

網上大部分都說要增加spark.yarn.executor.memoryOverhead，先是2048，然後4096，後來乾脆增加到15G（並將executor-memory調小到20G），不再報錯。

但一直很鬱悶，到底是爲什麼呢？

首先可以肯定的一點是增加spark.yarn.executor.memoryOverhead是有效的。

spark.yarn.XXX.memoryOverhead屬性決定向 YARN 請求的每個 executor 或dirver或am 的額外堆內存大小，默認值爲 max(384, 0.07 * spark.executor.memory)。

2.另一種解決方法

查另一篇博客，也是同樣的問題：We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high. We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.

也就是說executor運行的時候物理內存實際利用很低，但虛擬內存卻很高，然後在yarn-site.xml上將yarn.nodemanager.vmem-check-enabled設置爲false，問題就解決了。

Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.

這其實是系統問題，操作系統在分配虛擬內存的時候太過粗暴，因此你需要關閉虛擬內存檢查或者增加yarn.nodemanager.vmem-pmem-ratio

3.原因

在另一篇博客中

Virtual/physical memory checker

NodeManager can monitor the memory usage(virtual and physical) of the container. If its virtual memory exceeds “yarn.nodemanager.vmem-pmem-ratio” times the "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", then the container will be killed if “yarn.nodemanager.vmem-check-enabled” is true;

If its physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb", the container will be killed if “yarn.nodemanager.pmem-check-enabled” is true.

The parameters below can be set in yarn-site.xml on each NM nodes to override the default behavior.

This is a sample error for a container killed by virtual memory checker:

Current usage: 347.3 MB of 1 GB physical memory used; 
<font color="red">2.2 GB of 2.1 GB virtual memory used</font>. Killing container.

And this is a sample error for physical memory checker:

Current usage: <font color="red">2.1gb of 2.0gb physical memory used</font>; 
1.1gb of 3.15gb virtual memory used. Killing container.

As in Hadoop 2.5.1 of MapR 4.1.0, virtual memory checker is disabled while physical memory checker is enabled by default.

Since on Centos/RHEL 6 there are aggressive allocation of virtual memory due to OS behavior, you should disable virtual memory checker or increase yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.

f the above errors occur, it is also possible that the MapReduce job has memory leaking or the memory for each container is just not enough. Try to check the application logic and also tune the container memory request—"mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb".

4.總結

1.該問題的原因是因爲OS層面虛擬內存分配導致，物理內存沒有佔用多少，但檢查虛擬內存的時候卻發現OOM，因此可以通過關閉虛擬內存檢查來解決該問題，yarn.nodemanager.vmem-check-enabled=false

2.增加spark.yarn.executor.memoryOverhead也可以解決該問題（權宜之計）

3.我的程序報錯是在groupByKey階段，使用repartition（2000）（默認shuffle是200分區）後不再報錯，由此可見，設置更多的分區也可以解決該錯誤

4.groupByKey的性能是遠遜於reduceByKey的，因此可以考慮用reduceByKey代替groupByKey

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used

1.一種解決方法

2.另一種解決方法

3.原因

Virtual/physical memory checker

4.總結

微服務實踐k8s&dapr開發部署實驗（2）狀態管理

Win10 LTSC 2019 安裝後的一些步驟

Python 潮流週刊#52：Python 處理 Excel 的資源

pytorch1.3 Quantization

torch_scatter.scatter_add、Tensor.scatter_add_ 、Tensor.scatter_、Tensor.scatter_add 、Tensor.scatter

torch中reshape()和view()

深度可分卷積（MobileNet中的depthwise separable convolutions）

Pytorch中GNN的基類torch_geometric.nn.conv.MessagePassing

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結