記一次排查NN CPU過高線程卡住問題

背景

在使用 hadoop fs -ls /xxxhadoop fs -du -h /xxx時,出現特別卡頓的情況。懷疑namenode機器大CPU進程佔用。由此開始排查之旅。

過程

登錄nn機器(active和standby都要看),查看top數。
關於top的使用
通常使用top H,H表示顯示進程和線程,默認只看進程。

top H
   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
134105 hadoop    20   0  0.220t 0.156t   7212 R 64.1 63.3 270:32.10 java
134092 hadoop    20   0  0.220t 0.156t   7212 R 58.4 63.3 273:13.89 java
134095 hadoop    20   0  0.220t 0.156t   7212 R 57.2 63.3 275:16.29 java
134141 hadoop    20   0  0.220t 0.156t   7212 R 56.6 63.3 270:43.97 java
134123 hadoop    20   0  0.220t 0.156t   7212 R 54.7 63.3 269:13.20 java
134106 hadoop    20   0  0.220t 0.156t   7212 R 53.1 63.3 271:42.05 java
134148 hadoop    20   0  0.220t 0.156t   7212 R 52.8 63.3 272:43.75 java
134117 hadoop    20   0  0.220t 0.156t   7212 R 52.5 63.3 269:06.99 java
134135 hadoop    20   0  0.220t 0.156t   7212 R 52.2 63.3 269:10.03 java
 84753 hadoop    20   0  0.220t 0.156t   7212 R 52.2 63.3   1:17.63 java
134111 hadoop    20   0  0.220t 0.156t   7212 R 51.2 63.3 269:20.19 java
134147 hadoop    20   0  0.220t 0.156t   7212 R 51.2 63.3 270:16.40 java
134118 hadoop    20   0  0.220t 0.156t   7212 R 50.9 63.3 274:13.64 java

可以看到PID=134105的進程(或線程佔用CPU較高,測試使用,實際已降)

通過ps --forest 可以看父進程

分析jstack -l pid找到線程棧:

2019-12-12 17:47:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode):

"Attach Listener" #174318 daemon prio=9 os_prio=0 tid=0x00007f614a1ac000 nid=0x21985 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"qtp212683148-174317" #174317 daemon prio=5 os_prio=0 tid=0x00007f2910910800 nid=0x21783 waiting on condition [0x00007f29216cc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c805df18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
        at java.lang.Thread.run(Thread.java:745)

"qtp212683148-174316" #174316 daemon prio=5 os_prio=0 tid=0x00007f6165cfd800 nid=0x216c2 runnable [0x00007f29219cf000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00007f30c806bbc0> (a sun.nio.ch.Util$2)
        - locked <0x00007f30c806bbd8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00007f30c80821a8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:745)

"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn03.nmg02/10.76.50.29:8485 from hadoop" #174312 daemon prio=5 os_prio=0 tid=0x00007f616690e000 nid=0x21094 in Object.wait() [0x00007f2920ac4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
        - locked <0x00007f51d2aa11a8> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)

"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn02.nmg02/10.76.60.58:8485 from hadoop" #174311 daemon prio=5 os_prio=0 tid=0x00007f29104aa800 nid=0x21093 in Object.wait() [0x00007f29229d9000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
        - locked <0x00007f51d2aa3728> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)

"IPC Client (895511170) connection to bigdata-nmgv3-hdpnn01.nmg02/10.76.32.39:8485 from hadoop" #174310 daemon prio=5 os_prio=0 tid=0x00007f6165487000 nid=0x21092 in Object.wait() [0x00007f29208c2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:1024)
        - locked <0x00007f51d2aa5ca8> (a org.apache.hadoop.ipc.Client$Connection)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1068)

"IPC Parameter Sending Thread #58519" #174272 daemon prio=5 os_prio=0 tid=0x00007f61674a9800 nid=0x1f350 waiting on condition [0x00007f29230e0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"IPC Parameter Sending Thread #58518" #174271 daemon prio=5 os_prio=0 tid=0x00007f2919de0800 nid=0x1f34e waiting on condition [0x00007f2922fdf000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"IPC Parameter Sending Thread #58517" #174270 daemon prio=5 os_prio=0 tid=0x00007f2910efc000 nid=0x1f34c waiting on condition [0x00007f2920bc5000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c8048980> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
        at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

"qtp212683148-174164" #174164 daemon prio=5 os_prio=0 tid=0x00007f616719f800 nid=0x1c257 runnable [0x00007f2922ada000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00007f30c806f9a8> (a sun.nio.ch.Util$2)
        - locked <0x00007f30c806f9c0> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00007f30c8056a08> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
2019-12-12 17:47:33
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode):

"Attach Listener" #174318 daemon prio=9 os_prio=0 tid=0x00007f614a1ac000 nid=0x21985 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"qtp212683148-174317" #174317 daemon prio=5 os_prio=0 tid=0x00007f2910910800 nid=0x21783 waiting on condition [0x00007f29216cc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f30c805df18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
        at java.lang.Thread.run(Thread.java:745)

"qtp212683148-174316" #174316 daemon prio=5 os_prio=0 tid=0x00007f6165cfd800 nid=0x216c2 runnable [0x00007f29219cf000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
        - locked <0x00007f30c806bbc0> (a sun.nio.ch.Util$2)
        - locked <0x00007f30c806bbd8> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00007f30c80821a8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
        at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
        at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
        at java.lang.Thread.run(Thread.java:745)

將上述pid轉成16進制:

printf %x 134105
20bd9

在jstack中,我們搜索20bd9找到了這一條記錄:

"IPC Server handler 19 on 8020" #138 daemon prio=5 os_prio=0 tid=0x00007f6165d9d000 nid=0x20bd9 runnable [0x00007f2929b48000]
   java.lang.Thread.State: RUNNABLE
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
        at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
        at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
        at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
        at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:901)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:798)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseOnce(BlockPlacementPolicyRackFaultTolerant.java:224)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyRackFaultTolerant.chooseTargetInOrder(BlockPlacementPolicyRackFaultTolerant.java:94)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:438)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2802)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)

可以定位到在dn選擇塊的時候耗時。

同時我們看一下namenode日誌,找一下WRAN的,或者exception的日誌:

agePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable:  unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2019-12-12 17:38:04,818 INFO org.apache.hadoop.ipc.Server: IPC Server handler 33 on 8020, call Call#221 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.76.57.59:32800
java.io.IOException: File /erasurecoding_backup/hdp/g_swan/prod/swan/china/dw_online/publiclog/gsdriverloc/2019/11/21/16/.distcp.tmp.attempt_1561732060733_7707095_m_000070_0 could only be written to 4 of the 6 required nodes for RS-6-3-1024k. There are 1701 datanode(s) running and 1701 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2254)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2802)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:876)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:567)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:882)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2709)

可以定位到錯誤。是distcp的時候出現了錯誤。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章