hadoop任務卡死

hadoop 運行mapreduce的時候會卡死在 mapreduce.Job:Running job: job_1477030467429_0002  位置不動

思路一:分析:mapreduce卡死不動,原可能是  resourcemanager 或者 nodemanager 配置出錯

檢查yarn-site.xml(yarn.resourcemanager.hostname:配置了resourcemanager 的位置) 或者 slaves (配置了nodemanager 位置)配置文件

可能出現的錯誤:

  yarn.resourcemanager.hostname :配置出錯;

  slaves  沒有加入namenode節點;

  hosts 配置出錯

 

思路二:運行內存不足的問題 :

現象:Memory Total  爲 0;

 日誌:

2016-10-29 10:28:30,433 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories

2016-10-29 10:28:30,433 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs error, used space above threshold of 90.0%, removing from list of valid directories

2016-10-29 10:28:30,433 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs

2016-10-29 10:28:30,435 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs

2016-10-29 10:28:31,204 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1477706304072_0002_01_000001

2016-10-29 10:28:31,558 WARN org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Unexpected: procfs stat file is not in the expected format for process with pid 3202

2016-10-29 10:28:31,581 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 11809 for container-id container_1477706304072_0002_01_000001: 33.0 MB of 2 GB physical memory used; 1.6 GB of 4.2 GB virtual memory used

2016-10-29 10:28:31,910 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1477706304072_0002_01_000001 transitioned from RUNNING to KILLING

2016-10-29 10:28:31,910 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1477706304072_0002_01_000001

2016-10-29 10:28:31,954 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1477706304072_0002_01_000001 is : 143

2016-10-29 10:28:31,989 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1477706304072_0002_01_000001 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL

2016-10-29 10:28:31,989 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir/usercache/kequan/appcache/application_1477706304072_0002/container_1477706304072_0002_01_000001

2016-10-29 10:28:31,991 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=kequan       OPERATION=Container Finished - Killed   TARGET=ContainerImpl    RESULT=SUCCESS      APPID=application_1477706304072_0002    CONTAINERID=container_1477706304072_0002_01_000001

2016-10-29 10:28:31,991 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1477706304072_0002_01_000001 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE

2016-10-29 10:28:31,992 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1477706304072_0002_01_000001 from application application_1477706304072_0002

2016-10-29 10:28:31,992 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1477706304072_0002_01_000001 for log-aggregation

2016-10-29 10:28:31,992 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1477706304072_0002

2016-10-29 10:28:32,915 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1477706304072_0002_01_000001]

2016-10-29 10:28:34,582 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1477706304072_0002_01_000001

 

分析: used space above thresholdof 90.0%  磁盤空間超過90% ,MR運行很佔用磁盤空間,磁盤空間不夠用的時候,nodemanager被強行殺死;

  方法一:設置磁盤最高利用率爲 95 ;在yarn-site.xml目錄里加入下面的配置(治標不治本,MR運行時候,磁盤使用空間還有可能超過 95%

<property>

       <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>

        <value>95.0</value>

    </property>

 

方法二: 刪除磁盤裏面不用的空間

                在命令行執行  df -h 查看磁盤用了多少空間

[root@hadoop-senior01 modules]# df -h

Filesystem            Size  Used Avail Use% Mounted on

/dev/sda2              18G   14G  3.4G  80% /

tmpfs                 1.9G  372K  1.9G   1% /dev/shm

/dev/sda1             291M   37M  240M  14% /boot

     /dev/sda2   磁盤 就是指 系統所有內存;刪除系統裏面不用的文件或者軟件即可

 

 ERRORorg.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of thedisks failed. 1/1 local-dirs are bad:/opt/cdh/hadoop-2.5.0-cdh5.3.6/data/tmp/nm-local-dir; 1/1 log-dirs are bad:/opt/cdh/hadoop-2.5.0-cdh5.3.6/logs/userlogs

       參考: http://www.cnblogs.com/tnsay/p/5917459.html

方法三: 如果是虛擬機那麼久擴容;真機加磁盤

    虛擬機擴容: http://www.2cto.com/os/201405/301879.html       

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章