背景
在使用 hadoop fs -ls /xxx
或 hadoop fs -du -h /xxx
時,出現特別卡頓的情況。懷疑namenode機器大CPU進程佔用。由此開始排查之旅。
過程
登錄nn機器(active和standby都要看),查看top數。
關於top的使用
通常使用top H
,H表示顯示進程和線程,默認只看進程。
top H
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
134105 hadoop 20 0 0.220t 0.156t 7212 R 64.1 63.3 270:32.10 java
134092 hadoop 20 0 0.220t 0.156t 7212 R 58.4 63.3 273:13.89 java
134095 hadoop 20 0 0.220t 0.156t 7212 R 57.2 63.3 275:16.29 java
134141 hadoop 20 0 0.220t 0.156t 7212 R 56.6 63.3 270:43.97 java
134123 hadoop 20 0 0.220t 0.156t 7212 R 54.7 63.3 269:13.20 java
134106 hadoop 20 0 0.220t 0.156t 7212 R 53.1 63.3 271:42.05 java
134148 hadoop 20 0 0.220t 0.156t 7212 R 52.8 63.3 272:43.75 java
134117 hadoop 20 0 0.220t 0.156t 7212 R 52.5 63.3 269:06.99 java
134135 hadoop 20 0 0.220t 0.156t 7212 R 52.2 63.3 269:10.03 java
84753 hadoop 20 0 0.220t 0.156t 7212 R 52.2 63.3 1:17.63 java
134111 hadoop 20 0 0.220t 0.156t 7212 R 51.2 63.3 269:20.19 java
134147 hadoop 20 0 0.220t 0.156t 7212 R 51.2 63.3 270:16.40 java
134118 hadoop 20 0 0.220t 0.156t 7212 R 50.9 63.3 274:13.64 java
可以看到PID=134105的進程(或線程佔用CPU較高,測試使用,實際已降)
通過ps --forest 可以看父進程
分析jstack -l pid
找到線程棧:
2019-12-12 17:47:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode):
"Attach Listener" #174318 daemon prio=9 os_prio=0 tid=0x00007f614a1ac000 nid=0x21985 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"qtp212683148-174317" #174317 daemon prio=5 os_prio=0 tid=0x00007f2910910800 nid=0x21783 waiting on condition [0x00007f29216cc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f30c805df18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.lang.Thread.run(Thread.java:745)
"qtp212683148-174316" #174316 daemon prio=5 os_prio=0 tid=0x00007f6165cfd800 nid=0x216c2 runnable [0x00007f29219cf000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00007f30c806bbc0> (a sun.nio.ch.Util$2)
- locked <0x00007f30c806bbd8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007f30c80821a8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:745)
"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn03.nmg02/10.76.50.29:8485 from hadoop" #174312 daemon prio=5 os_prio=0 tid=0x00007f616690e000 nid=0x21094 in Object.wait() [0x00007f2920ac4000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
- locked <0x00007f51d2aa11a8> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)
"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn02.nmg02/10.76.60.58:8485 from hadoop" #174311 daemon prio=5 os_prio=0 tid=0x00007f29104aa800 nid=0x21093 in Object.wait() [0x00007f29229d9000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
- locked <0x00007f51d2aa3728> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)
"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn01.nmg02/10.76.32.39:8485 from hadoop" #174310 daemon prio=5 os_prio=0 tid=0x00007f6165487000 nid=0x21092 in Object.wait() [0x00007f29208c2000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
- locked <0x00007f51d2aa5ca8> (a org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)
"IPC Parameter Sending Thread #58519" #174272 daemon prio=5 os_prio=0 tid=0x00007f61674a9800 nid=0x1f350 waiting on condition [0x00007f29230e0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"IPC Parameter Sending Thread #58518" #174271 daemon prio=5 os_prio=0 tid=0x00007f2919de0800 nid=0x1f34e waiting on condition [0x00007f2922fdf000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"IPC Parameter Sending Thread #58517" #174270 daemon prio=5 os_prio=0 tid=0x00007f2910efc000 nid=0x1f34c waiting on condition [0x00007f2920bc5000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"qtp212683148-174164" #174164 daemon prio=5 os_prio=0 tid=0x00007f616719f800 nid=0x1c257 runnable [0x00007f2922ada000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00007f30c806f9a8> (a sun.nio.ch.Util$2)
- locked <0x00007f30c806f9c0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007f30c8056a08> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
2019-12-12 17:47:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode):
"Attach Listener" #174318 daemon prio=9 os_prio=0 tid=0x00007f614a1ac000 nid=0x21985 waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"qtp212683148-174317" #174317 daemon prio=5 os_prio=0 tid=0x00007f2910910800 nid=0x21783 waiting on condition [0x00007f29216cc000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f30c805df18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.lang.Thread.run(Thread.java:745)
"qtp212683148-174316" #174316 daemon prio=5 os_prio=0 tid=0x00007f6165cfd800 nid=0x216c2 runnable [0x00007f29219cf000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00007f30c806bbc0> (a sun.nio.ch.Util$2)
- locked <0x00007f30c806bbd8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007f30c80821a8> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:745)
將上述pid轉成16進制:
printf %x 134105
20bd9
在jstack中,我們搜索20bd9
找到了這一條記錄:
"IPC Server handler 19 on 8020" #138 daemon prio=5 os_prio=0 tid=0x00007f6165d9d000 nid=0x20bd9 runnable [0x00007f2929b48000]
java.lang.Thread.State: RUNNABLE
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:901)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:798)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:224)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:438)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2802)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)
可以定位到在dn選擇塊的時候耗時。
同時我們看一下namenode
日誌,找一下WRAN
的,或者exception
的日誌:
agePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-12-12 17:38:04,818 INFO org.apache.hadoop.ipc.Server: IPC Server handler 33 on 8020, call Call#221 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.76.57.59:32800
java.io.IOException: File /erasurecoding_backup/hdp/g_swan/prod/swan/china/dw_online/publiclog/gsdriverloc/2019/11/21/16/.distcp.tmp.attempt_1561732060733_7707095_m_000070_0 could only be written to 4 of the 6 required nodes for RS-6-3-1024k. There are 1701 datanode(s) running and 1701 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2254)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2802)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)
可以定位到錯誤。是distcp的時候出現了錯誤。