JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-90  -  Job never ends after reconnection
Posted Oct 26, 2012 - updated Nov 15, 2012
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     jandam
  • Owned by
    Not owned by anyone
  • Category
    Core
  • Resolution
    RESOLVED
  • Priority
    Critical
  • Reproducability
    Often
  • Severity
    Critical
  • Targetted for
    icon_milestones.png JPPF 3.2
Issue description
After reconnection some tasks stays on Node and waitForResults never ends.
Steps to reproduce this issue
Simulate ConnectException during compute.

#5
Comment posted by
 lolo4j
Nov 15, 04:46
in JPPFNode.perform(), we're missing an exception handling in case there is a socket error while writing the results of a job. When this happens, the class loader connection is not reinitialized and the node just hangs, because all subsequent class loading requests fail with an NPE. I've added the following:
catch(SocketException e)
{
  log.error(e.getMessage(), e);
  reset(true);
  throw new JPPFNodeReconnectionNotification(e);
}
After this, the node reconnects properly after the driver is restarted, and the job is resubmitted and ends normally.

There is however one remaining problem: the socket error is only detected when the results are written. This means the node may wait for a very long time that the execution of the current job is complete. A possible mitigation would be to use the recovery mechanism, however I feel this isn't very satisfying.

The problem is that, while the job is being executed in the node, we are neither reading from, nor writing to the socket connection, so we can't detect that the connection is closed.
#6
Comment posted by
 lolo4j
Nov 15, 21:45
This is now fixed. I added a connection checker mechanism in the node, which checks if the connection to the driver is still working while the taskss are executing. It is then suspended, until the next job arrives and starts executing. As this mechanism makes execution a little slower (although it is negligible for long-lived jobs), I made it optional via a configuration property "jppf.node.check.connection = false"

Changes committed to SVN trunk revision 2547

The issue was updated with the following change(s):
  • This issue has been closed
  • The status has been updated, from New to Closed.
  • This issue's progression has been updated to 100 percent completed.
  • The resolution has been updated, from Not determined to RESOLVED.
  • Information about the user working on this issue has been changed, from lolo4j to Not being worked on.