必須要記錄一次的spark-submit報錯
spark任務若出現由於內存不足導致任務失敗的情況:
一:大多數情況想的是可能 因爲shuffle過程太耗內存,導致executor執行不成功,所以增大executor-memory的大小和core的數量
二、也要記住,雖然你申請了很大的內存,但是可能集羣資源並沒有那麼多:
即你在提交spark任務時的contanier的內存總大小(每個excutor個數乘上每個excutor的內存),超過了在 ambari YARN中配置的container的總大小。
一、命令行日誌出現:
- WARN : scheduler.ShuffleMapStage: ShuffleMapStage 0 is now unavailable on executor 8
- WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e143_1591884070030_22633_01_000338 on host: hd123. Exit status: 134. Diagnostics: Exception from container-launch.
- ERROR cluster.YarnScheduler: Lost executor 2 on hd030.corp.yodao.com: Container marked as failed:
- WARN scheduler.TaskSetManager: Lost task 3.1 in stage 1.0 (TID 156, hd030): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=3, message=org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
如圖:
二、yarn日誌出現:(yarn日誌查看方法)
- Exception in thread “main” java.io.IOException: Failed to connect to /IP:port
20/06/17 19:21:02 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
三、application 可視化UI界面
其中可以查看每個executor上的日誌,要在任務執行的時候多刷新,然後正在執行的時候,去查看executor日誌:
解決方案:
(注意:contanier的內存總大小(每個excutor個數乘上每個excutor的內存),不要超過了在 ambari YARN中配置的container的總大小)
spark-submit參數:–num-executors 26 --driver-memory 15g --executor-memory 8g --executor-cores 2
- 先調小–num-executors,不行在調小–executor-memory
- 可以從最小開始,如果任務執行成功,在慢慢調大到合適的配置
還有其他的方案,就是在集羣資源足夠的情況下,調大一些參數:
- –conf spark.sql.shuffle.partitions=2048 ,默認爲200
- 調整spark配置:yarn-site.xml中
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>10</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
- 等等,後面有在瞭解的在增加,可以互相交流