Nothing entered.
In order to reproduce behavior I'm just executing simple job with client timeout set:
val job = new JPPFJob(s"Job $i")
val timeout = new JPPFSchedule(Random.nextInt(1000))
job.getClientSLA.setJobExpirationSchedule(timeout)
job.add(new Callable[String] with Serializable { def call(): String = {println(s"Task $i"); s"Task $i"} })
val result = client.submitJob(job).get(0)
println(s"Job [$i], cancelled: [${job.isCancelled}], result: [${result.getResult}], exception: [${result.getThrowable}]")
I'm randomizing timeout value to have some jobs canceled, and some actually executed; also I want timeout to trigger at various stages of job processing. Now I'm just running this code 1000 times in a loop against single jppf client. Cluster consists of JPPF 5.2.2 driver and single node (all with default configuration).
And now - depending on luck:
1. Job never finishes (awaits results forever).
When this happened - job was actually executed on node (node console had "Task 428" printout).
[org.jppf.server.job.JPPFJobManager.jobQueued(146)]: jobId 'Job 428' queued
[org.jppf.server.job.JPPFJobManager.jobUpdated(178)]: jobId 'Job 428' updated
[org.jppf.server.job.management.DriverJobManagement.sendNotification(233)]: sending event JOB_QUEUED for job JobInformation[jobUuid=Job 428, jobName=Job 428, taskCount=0, initialTaskCount=1, priority=0, suspended=false, pending=false, maxNodes=2147483647], node=null
[org.jppf.server.job.management.DriverJobManagement.sendNotification(233)]: sending event JOB_UPDATED for job JobInformation[jobUuid=Job 428, jobName=Job 428, taskCount=1, initialTaskCount=1, priority=0, suspended=false, pending=false, maxNodes=2147483647], node=null
[org.jppf.server.job.JPPFJobManager.jobUpdated(178)]: jobId 'Job 428' updated
[org.jppf.server.job.management.DriverJobManagement.sendNotification(233)]: sending event JOB_UPDATED for job JobInformation[jobUuid=Job 428, jobName=Job 428, taskCount=0, initialTaskCount=1, priority=0, suspended=false, pending=false, maxNodes=2147483647], node=null
[org.jppf.server.job.JPPFJobManager.jobDispatched(120)]: jobId 'Job 428' : added node RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=SENDING_BUNDLE, uuid=6D331B8F-D1BB-D0B2-6B4A-62D7F33E0810, connectionUuid=null, peer=false, ssl=false]
[org.jppf.server.job.management.DriverJobManagement.sendNotification(233)]: sending event JOB_DISPATCHED for job JobInformation[jobUuid=Job 428, jobName=Job 428, taskCount=1, initialTaskCount=1, priority=0, suspended=false, pending=false, maxNodes=2147483647], node=JPPFManagementInfo[127.0.0.1:12001, type=node|MASTER, local=false, secure=false, active=true, uuid=6D331B8F-D1BB-D0B2-6B4A-62D7F33E0810]
[org.jppf.server.job.management.DriverJobManagement.cancelJob(69)]: Request to cancel job 'Job 428'
[org.jppf.server.job.JPPFJobManager.jobReturned(132)]: jobId 'Job 428' : removed node RemoteNodeContext[channel=SelectionKeyWrapper[id=4], state=WAITING_RESULTS, uuid=6D331B8F-D1BB-D0B2-6B4A-62D7F33E0810, connectionUuid=null, peer=false, ssl=false]
[org.jppf.server.job.management.DriverJobManagement.sendNotification(233)]: sending event JOB_RETURNED for job JobInformation[jobUuid=Job 428, jobName=Job 428, taskCount=1, initialTaskCount=1, priority=0, suspended=false, pending=false, maxNodes=2147483647], node=JPPFManagementInfo[127.0.0.1:12001, type=node|MASTER, local=false, secure=false, active=true, uuid=6D331B8F-D1BB-D0B2-6B4A-62D7F33E0810]
Note - there is no JOB_ENDED notification after JOB_RETURNED.
2. Job doesn't have cancelled flag set, but result=null and exception=null
Here also job was actually executed on node.
3. Job fails with java.lang.NoClassDefFoundError
This usually happens after first job was canceled:
Submitting job [1] with timeout [schedule[delay=20]]
Task [1], cancelled: [true], time: [24 ms], result: [null], exception: [null]
Submitting job [2] with timeout [schedule[delay=819]]
Task [2], cancelled: [false], time: [17 ms], result: [null], exception: [java.lang.NoClassDefFoundError: Could not initialize class scala.Predef$]
4. Node gets locked - it stops executing tasks, I think it also disappears from management console.
This is actually original problem I'm trying to solve.
It also seems to be related to task cancelation and classloading, as in node log I found:
2016-10-13 16:54:51,044 [DEBUG][org.jppf.classloader.AbstractJPPFClassLoaderLifeCycle.loadResource(165)]: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Unknown Source)
at org.jppf.classloader.ResourceFuture.get(ResourceFuture.java:95)
at org.jppf.classloader.ResourceFuture.get(ResourceFuture.java:32)
at org.jppf.classloader.AbstractClassLoaderConnection.loadResource(AbstractClassLoaderConnection.java:121)
...
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Additionally, could you attach the node and driver logs to this defect?
Also, I see that you're running the client and jobs from Scala. Are the Scala libraries deployed on the node side, or are they downloaded from the client side? This could cause class loader problems as we have seen in the past for other languages like Clojure.
Lastly, my apologies for the missing 5.2.2 release and 5.2.3 target milestone, this is now fixed.
Really delete this comment?
Really delete this comment?
Really delete this comment?
I see 2 possible fixes:
Really delete this comment?
Regarding `issue 3`
lolo4j wrote:
But issue is not that first job has timed out due to lots of classes loaded over network. Problem is that second job, executed just after it - runs in context of somehow broken class loader resulting in class not found error. Rather than that, it should also timeout (if it also did not manage load necessary classes and perform work within specified timeout), or it should execute successfully. Now it seems once this happens, session enters some invalid state - and one has to use another jppf connection to be able to execute some other task.
I thought `job.getSLA.getClassPath.setForceClassLoaderReset(true)` might be used as kind of workaround, but doesn't seem to help much.
Deploying libraries to nodes will decrease probability of problems by reducing overall classloading time, but issue still remains and probably could happen even with single-class, no-dependencies, java-only task being executed on infrastructure with really low network throughput between hosts.
regards, marcin
Really delete this comment?
Thank you very much for your input. I totally agree with you. From what I've seen the issue boils down to the initial NoClassDefoundError, which is not properly handled by the JPPF class loader. I believe it is triggered by th fact that the task is cancelled, which is done by interupting the thread that executes the task. At the time this happens, the class loader is waiting for the driver to send the response to a class loading request, while in the scope of a call to ClassLoader.define(...).
I'm investigating how this should be handled.
Really delete this comment?
Really delete this comment?
Here's the code of the AbstractJPPFClassLoaderLifeCycle.loadResource() method:
We can see the InterruptedException is caught here and the returned ResourceWrapper is null. Because of this null value, the caller method higher in the stack AbstractJPPFClassLoader.findClass() will raise a ClassNotFoundException, This method is called by the native method ClassLoader.defineClass1() which I suspect is throwing the NoClassDefFoundError upon catching the ClassNotFoundException. I'm guessing here, since I didn't look in the native code to check it, however I couldn't find another non-native method in the call stack that throws a NoClassDefoundError (and I did look at the java code for these).
In our code, there's only one place that calls ClassLoader.defineClass(). I'll try and catch the NoClassDefoundError there and wrap it in in a ClassNotFoundException, which should be the normal behavior. I will try and see if it's possible to propagate the information that the thread was interrupted, so it's possible to make a decision as whether to add the class to the "not found cache" (we shouldn't in this scenario, but there are other real reasons for throwing a NoClassDefoundError).
Really delete this comment?
Really delete this comment?
Really delete this comment?
For example, to avoid interuption of the tasks's thread, a callable task should be implemented like this:
A task that extends AbstractTask would only need to override isInterruptible().
With this, when a task is cancelled, its thread will not be interrupted, so any class loading taking place will complete.
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
For other types of tasks, like those implementing Callable, you need to set a JPPFTaskCallback onto the JPPFAnnotatedTask wrapping the Callable:
zorba128 wrote:
Really delete this comment?
Really delete this comment?
And node logs:
The issue was updated with the following change(s):
Really delete this comment?
Really delete this comment?
There is a non-public API for a callback invoked immediately upon cancellation of a task, used only in the .Net port. I have adapted JPPFAnnotatedTask to use it, so it becomes available by implementing the interface org.jppf.node.protocol.CancellationHandler, for instance like this:
I have uploaded a jar file to this bug report that includes this fix. Can you give it a try? The jar also includes a fix for Bug report JPPF-483 - ConcurrentModificationException in AbstractExecutionManager, which may be causing the job hung problem. If this doesn't resolve the job hung issue, could please provide a sample code to reproduce it?
Really delete this comment?
Really delete this comment?
I don't think anything changed after applying patch; test is exactly as described at the very beginning if this bug report. See complete java code to reproduce issue:
You might want to fine-tune delays/randomization range, but on my setup (ms windows machine with local driver and local node launched with default config) I'm able to hang jppf with 100% accuracy.
Really delete this comment?
Really delete this comment?
Really delete this comment?
Can you give it a try? I attached a new patched jar for the server (file jppf-server-5.2.4-SNAPSHOT.jar).
Really delete this comment?
Glad to hear that; with both patches applied everything seems fine at first sight. We'll do some real-life testing after 5.2.4 is released and available as maven dependency.
I'll let you know if we find out something.
marcin
Really delete this comment?
Really delete this comment?
Really delete this comment?
The issue was updated with the following change(s):
Really delete this comment?
Really delete this comment?
Really delete this comment?