JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
bug_report_small.png
CLOSED  Bug report JPPF-392  -  Unresponsive node JMX hangs all driver JMX node info
Posted May 08, 2015 - updated Aug 15, 2018
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Bug report
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Progress
       
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     Daniel Widdis
  • Owned by
    Not owned by anyone
  • Category
    JMX connector
  • Resolution
    RESOLVED
  • Priority
    Normal
  • Reproducability
    Always
  • Severity
    Not determined
  • Targetted for
    icon_milestones.png JPPF 5.0.x
Issue description
When some nodes become unresponsive to JMX but are still connected, JMX to/from the driver which collects node information appears to also become unresponsive until the problematic node connection is severed by other means such as a reboot. The driver continues to process jobs/tasks to other nodes, and provides some summary statistics (queue length, number of connected nodes, etc.) but does not provide detailed information (e.g, number of slaves) for any node if the problematic node is included in the UuidSelector.

Steps to reproduce this issue
  1. Set up a JPPF network with several nodes.
  2. Monitor the network with the Admin GUI
  3. On some of the nodes, cause a memory leak which causes them to reach a state of near-continuous garbage collection so that they do not respond to JMX (but are still connected)
  4. Observe that the Admin GUI ceases to update information
  5. From a client, attempt to query node information for several nodes via a UuidSelector and await forever (or at least a very long time), e.g.,
jmx.getNodeForwarder()
  .forwardGetAttribute(new UuidSelector(masterUuids),
    JPPFNodeProvisioningMBean.MBEAN_NAME, "NbSlaves");
  1. Reboot the machine(s) for the "hung" nodes. Observe the Admin GUI and other client JMX return to normal operation.
(LC) Edit:

To automatically generate the condition where the node has almost no memory left, I deploy the following node startup class on a single node:
public class MyNodeStartup implements JPPFNodeStartupSPI {
  static byte[][] array;
 
  @Override
  public void run() {
    System.gc();
    Runtime runtime = Runtime.getRuntime();
    long free = runtime.maxMemory() - (runtime.totalMemory() - runtime.freeMemory());
    double percentageToAllocate = 0.95d;
    long sizeToAllocate = (long) (percentageToAllocate * free);
    int size = 8*1024;
    int arraySize = (int) sizeToAllocate / size;
    System.out.printf("available memory = %,d, allocating array[%,d][%,d]%n", free, arraySize, size);
    try {
      array = new byte[arraySize][];
      for (int i=0; i<arraySize; i++) array[i] = new byte[size];
    } catch(Error e) {
      e.printStackTrace();
      throw e;
    }
  }
}
I chose a startup class because they are laoded and run after the JMX server is initialized. Also note that the large byte[][] is allocated in small chunks of 8KB, so as to avoid heap fragmentation causing a premature OOME. Once the static arry is initialzed, any attempt to execute a task will fill up what little heap remains available and cause all further jmx requests to fail with an OOME. I also observed that simply connecting to the node with VisualVM has the same effect, so no JPPF client is actually needed.

#3
Comment posted by
 lolo4j
May 08, 09:49
It looks like the problem is that the driver is awaiting a response for the JMX request on each selected node, and when a node is not responding (either with an exception or a result) then the driver waits forever.

The solution I see now is to have the server set a timeout on each request and return an error as part of the result of the forwarding request, for the related node, when the timeout expires. I suppose I can add a server configuration property to set the timeout, like "jppf.forwarding.timeout = time_in_millis", or maybe a more general configuration property for any JMX request made via a JMXConnectionWrapper instance, like "jppf.jmx.request.timeout". Well in fact that sounds better smileys/2.png
#4
Comment posted by
 Daniel Widdis
icon_reply.pngMay 08, 17:24, in reply to comment #3
lolo4j wrote:
The solution I see now is to have the server set a timeout on each request
and return an error as part of the result of the forwarding request, for
the related node, when the timeout expires. I suppose I can add a server
configuration property to set the timeout, like "jppf.forwarding.timeout =
time_in_millis", or maybe a more general configuration property for any JMX
request made via a JMXConnectionWrapper instance, like
"jppf.jmx.request.timeout". Well in fact that sounds better


Unfortunately, changing the behavior of the existing method to return an error (exception or null) will probably break current code, although I suppose having the default timeout very long would mitigate that. I might suggest adding an optional third argument to the methods, specifying a timeout, but there are a lot of methods so that might not be practical.

#5
Comment posted by
 lolo4j
icon_reply.pngMay 09, 05:56, in reply to comment #4
The timeout would be passed on to the JMXMP remote connector client (see the JMXConnectionWrapper constructor), which already has a timeout with a default value of Long.MAX_VALUE. We'll just keep the same default value when the new configuration property is unspecified, thus no existing installation will have its behavior modified. Basically we're not changing anything here, we merely expose a configuration property that was already present in the jmx_remote code.
#8
Comment posted by
 lolo4j
May 09, 07:46
I've also been exploring a way to automatically kill the node when an OOME occurs. This cannot be done reliably in the Java code, however the -XX:OnOutOfMemoryError allows to do that without much code. Thus I tried with the following in the node's configuration:
kill.cmd = $script{ \
   java.lang.System.getProperty("os.name").toLowerCase().contains("windows") \
     ? "taskkill /PID %p /F" \
     : "kill -9 %p"; \
 }$
jppf.jvm.options = -XX:OnOutOfMemoryError="${kill.cmd}" -server -Xmx1024m
This works very well, but I'd like to find a way to actually restart the node, e.g. by forcing an exit code 2
#9
Comment posted by
 lolo4j
icon_reply.pngMay 13, 09:10, in reply to comment #8
Or with a little more concise syntax using a javascript string instead of a java string for the OS name:
kill.cmd = $script{ "${sys.os.name}".toLowerCase().indexOf("windows") >= 0 ? "taskkill /PID %p /F" : "kill -9 %p"; }
jppf.jvm.options = -XX:OnOutOfMemoryError="${kill.cmd}" -server -Xmx1024m
#10
Comment posted by
 lolo4j
May 14, 07:06
Fixed in: