![]() Please wait while updating issue type...
Could not save your changes
This issue has been changed since you started editing it
Data that has been changed is highlighted in red below. Undo your changes to see the updated information
You have changed this issue, but haven't saved your changes yet. To save it, press the Save changes button to the right
This issue is blocking the next release
![]() There are no comments
This issue has no duplicates
There are no code checkins for this issue |
|||||||||||||||||||||||||||||||||||
Really delete this comment?
Really delete this comment?
I recently changed my node's PermGen settings so it's possible setting this on the slave nodes is limiting the number that can launch:
But I am only running 1 master + 2 slaves on an 8GB machine so that shouldn't be a problem... or maybe it is.
Symptom remains so it's not the automatic provisioning at fault, but there's something amiss with whatever's causing these nodes to fail
Really delete this comment?
When I said "suppressed" the node provisioning, I did so by specifying "0" for the value of slaveNodes in the above code.
However, to rule out whether there was a problem with the node configuration I ran just the nodes and the standard JPPF task code with no problems. (Other than a display of 0%CPU on active slave nodes when the master was idle... but they were doing work...)
So the problem is in my monitoring client, and is likely in the provisioning code above, or the associated JMX. The only other change to the monitoring code since it worked well is some server listing code, quoted below.
Really delete this comment?
Really delete this comment?
Really delete this comment?
This happens once the slave node has been selected for tasks dispatching, but before its server-side channel is transitioned to its new state. What's missing is some exception handling code in or around the call to TaskQueueChecker.dispatchJobToChannel().
Really delete this comment?
Really delete this comment?
Really delete this comment?
I'll upload and do another test.
lolo4j wrote:
Really delete this comment?
lolo4j wrote:
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Unfortunately, it looks like this just recurred, so it's not fixed....
I currently have 3200 idle nodes... and counting....
lolo4j wrote:
Really delete this comment?
Some of the nodes show classloading errors. Unfortunately I didn't copy the stack trace before shutting it down.
Daniel Widdis wrote:
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
It may also not be the change from 3.4.1 to 3.4.3 but the change in how quickly I provision slave nodes. Before I had tended to do it manually server-by-server, but lately I just do a ctrl-A to select the whole list and provision all at once. (And I've repeated this process multiple times as more of my cloud nodes connect.)
Any guidance on what you might need from me for troubleshooting? My plan was to try the following:
lolo4j wrote:
Really delete this comment?
To reproduce with a single master, I have a separate thread in my client which provisions 20 slaves, waits 10s, un-provisions the slaves, waits 1s. Wash, rince, repat. The greater the number of slaves, the sooner the issue is triggered. I'm using a job streaming pattern in the client with tasks that simply wait for 1ms then log a string via log4j2, so they don't have to do much. The driver uses "nodethread" load-balancing with multiplicator = 1
Additional clues that I found by looking at the state of the driver with VisualVM, after I tried and killed all nodes from the console:
Really delete this comment?
This exception prevents the handleException() method of the corresponding channel from being invoked, thus it is not properly closed and discarded from the server.
Really delete this comment?
However I'm still seeing a job hang at some point, and also a number of jmx@xxx threads trying to connect to he nodes via JMX, even though those nodes are dead ...
The good new is that now I've completely automated the test and the failure detection (job hung detected by the server no longer incrementing the executed tasks count), so I'm wasting less time on this. My test is sending 1000 jobs with 1000 tasks each, one at a time. So far I've never been through all 1000 jobs, so the test is a valid one. Once it passes, I'll run a much larger number of jobs (by at least 1 or 2 orders of magnitude) to make sure the server will hold over time. But we're not there yet
Really delete this comment?
I guess I should find a way to identify the node in ServerTaskBundleNode.toString(), if possible.
The node channels show only the master node:
The admin console's job data view still shows the job is dispatched to 2 dead nodes:
And I see there are 2 live "jmx@xxx" threads that should have died:
This is turning out to be a hard to find issue, but this is also an excellent test of how the grid behaves under that kind of stress, with frequent massive dynamic changes in the topology. I apologize for the time it takes, this is one of those bugs that require time, patience, hard work and imagination. I've already had bugs where I literally spent weeks finding the cause and a solution, this looks like one of them. I just ain't giving up, ever.
Really delete this comment?
lolo4j wrote:
No problem at all! I'm just happy when you can reproduce the same symptoms I have and trust you're taking the time to fix it the right way. I have workarounds for the issues and can be patient.
Really delete this comment?
The only issue now remaining, or at least that I can detect, is a job hang due to a single batch of dispatched tasks that is never returned to the client. An addtional clue is those warnings I sometimes see in the client log:
This indicates that the server is sending back results that were already sent to the client, or that there may be a confusion in the tasks positions. I'm still trying to figure out when and where this happens, which is proving difficult.
Really delete this comment?
Fixes in
Really delete this comment?
Really delete this comment?
Really delete this comment?