JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-344  -  Server deadlock with many slave nodes
Posted Oct 26, 2014 - updated Aug 15, 2018
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     lolo4j
  • Owned by
    Not owned by anyone
  • Category
    Server
  • Resolution
    RESOLVED
  • Priority
    Normal
  • Reproducability
    Always
  • Severity
    Normal
  • Targetted for
    icon_milestones.png JPPF 4.2.x
Issue description
While conducting a test with 1 master node and 40 slave nodes, I get the following deadlock:
Deadlock detected
 
- thread id 116 "JobManager-0001" is waiting to lock java.util.concurrent.locks.ReentrantLock$NonfairSync@2fdd2a9e which is held by thread id 18 "TaskQueueChecker"
- thread id 18 "TaskQueueChecker" is waiting to lock java.util.concurrent.locks.ReentrantLock$NonfairSync@7870d4a7 which is held by thread id 27 "NodeJobServer-0001"
- thread id 27 "NodeJobServer-0001" is waiting to lock java.util.LinkedHashSet@37995db7 which is held by thread id 18 "TaskQueueChecker"
 
Stack trace information for the threads listed above
 
"JobManager-0001" - 116 - state: WAITING - blocked count: 19 - blocked time: 68 - wait count: 15 - wait time: 569908
  at sun.misc.Unsafe.park(Native Method)
  - waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@2fdd2a9e
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
  at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
  at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
  at org.jppf.server.protocol.AbstractServerJobBase.getTaskCount(AbstractServerJobBase.java:106)
  at org.jppf.server.job.JobEventTask.run(JobEventTask.java:95)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
 
  Locked ownable synchronizers:
  - java.util.concurrent.ThreadPoolExecutor$Worker@7cac281f
 
"TaskQueueChecker" - 18 - state: WAITING - blocked count: 63 - blocked time: 62 - wait count: 47592 - wait time: 622017
  at sun.misc.Unsafe.park(Native Method)
  - waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@7870d4a7
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
  at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
  at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
  at org.jppf.nio.StateTransitionManager.transitionChannel(StateTransitionManager.java:135)
  at org.jppf.nio.StateTransitionManager.transitionChannel(StateTransitionManager.java:122)
  at org.jppf.server.nio.nodeserver.AbstractNodeContext.submit(AbstractNodeContext.java:460)
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.dispatchJobToChannel(TaskQueueChecker.java:309)
  - locked org.jppf.nio.SelectionKeyWrapper@6ccbdf89
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.dispatch(TaskQueueChecker.java:248)
  - locked java.util.LinkedHashSet@37995db7
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.run(TaskQueueChecker.java:218)
  at java.lang.Thread.run(Thread.java:744)
 
  Locked ownable synchronizers:
  - java.util.concurrent.locks.ReentrantLock$NonfairSync@2fdd2a9e
 
"NodeJobServer-0001" - 27 - state: BLOCKED - blocked count: 12 - blocked time: 569705 - wait count: 99 - wait time: 48230
  at org.jppf.server.nio.nodeserver.TaskQueueChecker.addIdleChannel(TaskQueueChecker.java:150)
  - waiting on java.util.LinkedHashSet@37995db7
  at org.jppf.server.nio.nodeserver.NodeNioServer.updateConnectionStatus(NodeNioServer.java:258)
  at org.jppf.server.nio.nodeserver.NodeNioServer.access$1(NodeNioServer.java:253)
  at org.jppf.server.nio.nodeserver.NodeNioServer$1.executionStatusChanged(NodeNioServer.java:109)
  at org.jppf.server.nio.nodeserver.AbstractNodeContext.fireExecutionStatusChanged(AbstractNodeContext.java:503)
  at org.jppf.server.nio.nodeserver.AbstractNodeContext.setState(AbstractNodeContext.java:378)
  at org.jppf.server.nio.nodeserver.AbstractNodeContext.setState(AbstractNodeContext.java:1)
  at org.jppf.nio.StateTransitionManager.transitionChannel(StateTransitionManager.java:150)
  - locked org.jppf.nio.SelectionKeyWrapper@75121e04
  at org.jppf.nio.StateTransitionTask.run(StateTransitionTask.java:83)
  - locked org.jppf.nio.SelectionKeyWrapper@75121e04
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
 
  Locked ownable synchronizers:
  - java.util.concurrent.locks.ReentrantLock$NonfairSync@7870d4a7
  - java.util.concurrent.ThreadPoolExecutor$Worker@486f03ec
See full thread dump attached.
Steps to reproduce this issue
  • start 1 driver configured with nodethreads load balancer, 1 node (I have 8 cores on that machine), 1 admin console
  • from the admin console provision 40 slave nodes
  • run a client which submits a job with 1000 tasks
  • ==> before the job can end, there is a deadlock as indicated in the description

#2
Comment posted by
 lolo4j
Oct 26, 10:46
A file was uploaded. full thread dumpicon_open_new.png
#4
Comment posted by
 lolo4j
Oct 26, 17:49
fixed in