Nothing entered.
In some situations, a node becomes disconnected from the driver but still appears as available/idle on the driver. When new jobs are sent to the driver, it attempts to dispatch these to the failed nodes, decrementing the queue in the process. Since the node never actually receives these tasks, nothing happens. When another node becomes available, the tasks are dispatched to the working node, decrementing the queue an additional time. This results in negative queue lengths.
For example, I was processing tasks on six active nodes on Rackspace cloud servers. In order to change some coding, I stopped the client and "deleted" the cloud servers. Since they were not shut down from within JPPF, the driver did not recognize that they went away, and they still displayed on the Monitoring and Administration tool as Connected / Executing.
I restarted my client, submitting two jobs with eight total tasks to the driver.
13:28:55,153 INFO [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
13:28:55,153 INFO [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
Of note, the long-task-assessors are higher priority than short-task-assessors, but restricted by SLA to only process on servers with 1 thread. I'm using the nodethreads load balancing profile with multiplicator 1.
The JobNotificationListener on the server reported the following.
13:28:55,542 INFO [ServerMonitor] - Job short-task-assessors queued. Queue now has 1 tasks.
13:28:55,637 INFO [ServerMonitor] - Job short-task-assessors started
13:28:55,728 INFO [ServerMonitor] - Job short-task-assessors dispatched 1 tasks to 10.176.67.98. Queue now has 1 tasks.
13:28:55,827 INFO [ServerMonitor] - Job short-task-assessors dispatched 1 tasks to 10.176.68.115. Queue now has 1 tasks.
The first entry is the JOB_QUEUED event. Queue should have incremented to 2.
The second and third entry is the first JOB_DISPATCHED event. The fourth entry is the second JOB_DISPATCHED event. Note that neither of these node servers actually existed. The tasks were never actually dispatched anywhere. The queue is really actually 2 at this point.
It's odd that the queue ended up at 1 here. It should either be 2 (with no dispatches) or 0 (if both were dispatched).
Now continuing:
13:28:56,323 INFO [ServerMonitor] - Job long-task-assessors queued. Queue now has 3 tasks.
13:28:56,418 INFO [ServerMonitor] - Job long-task-assessors started
13:28:56,510 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.219. Queue now has 3 tasks.
13:28:56,606 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.69.44. Queue now has 3 tasks.
13:28:56,702 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.214. Queue now has 3 tasks.
13:28:56,796 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.145. Queue now has 3 tasks.
The first line is JOB_QUEUED, and last four lines are JOB_DISPATCHED. Note again, these servers don't exist. Tasks were never successfully dispatched. The queue is really actually 8 at this point. Since we started at 1, the final value of 3 (incrementing by 2) makes sense for queueing 6 and dispatching 4. But the overall total is given immediately on queueing, not waiting for the dispatching.
Now, two of my new cloud server nodes come online:
13:31:43,811 INFO [ServerCreator] - [10.176.66.237] (512 MB, 1 vCPU) configured for long-task-assessors.
13:31:45,660 INFO [ServerCreator] - [10.176.68.11] (512 MB, 1 vCPU) configured for long-task-assessors.
13:31:47,975 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.237. Queue now has 2 tasks.
13:31:48,749 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.68.11. Queue now has 1 tasks.
At this point the queue is actually 6, but is showing as 1.
Two more nodes come online:
13:32:04,611 INFO [ServerCreator] - [10.176.66.168] (512 MB, 1 vCPU) configured for short-task-assessors.
13:32:05,324 INFO [ServerCreator] - [10.176.66.9] (512 MB, 1 vCPU) configured for long-task-assessors.
13:32:07,602 INFO [ServerMonitor] - Job short-task-assessors dispatched 2 tasks to 10.176.66.168. Queue now has -1 tasks.
13:32:07,787 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.9. Queue now has -2 tasks.
The first server has two threads, and since the long-tasks can't process on it, the driver sends two tasks from the short-task assessor. Note that these two tasks were the first ones allegedly dispatched to the nonexistent nodes earlier. The fact they were still sitting around showed they were still on the queue despite the queue having been decremented.
More servers come online:
13:33:32,147 INFO [ServerCreator] - [10.176.67.31] (512 MB, 1 vCPU) configured for long-task-assessors.
13:33:36,039 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.67.31. Queue now has -3 tasks.
13:33:39,275 INFO [ServerCreator] - [10.176.66.230] (512 MB, 1 vCPU) configured for long-task-assessors.
13:33:39,299 INFO [ServerCreator] - [10.176.66.52] (512 MB, 1 vCPU) configured for long-task-assessors.
13:33:43,169 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.52. Queue now has -4 tasks.
13:33:44,154 INFO [ServerMonitor] - Job long-task-assessors dispatched 1 tasks to 10.176.66.230. Queue now has -5 tasks.
Here the queue should actually be 0, but it remains at -5 for the duration of the process.
EDIT: I've changed the Reproducability to "Always" by finding out that the key factor is whether any of the remaining servers are "idle" when they are disconnected. Since initially reporting the error I had updated my shutdown/server deletion code which made it harder to reproduce in normal operations; but now that I know what's going on I can reproduce at will.
Additional symptoms:
- I started a set of tasks, and "killed" the node servers while they were still executing and there were still 2 tasks on the driver queue.
- I stopped the client and started it again, and the two extra tasks showed up on the queue size. (I submitted 8 tasks):
16:04:09,450 INFO [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
16:04:09,451 INFO [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
16:04:09,824 INFO [ServerMonitor] - Job short-task-assessors queued. Queue now has 4 tasks.
16:04:10,582 INFO [ServerMonitor] - Job long-task-assessors queued. Queue now has 10 tasks.
- The new tasks executed, but the 2 "phantom" tasks from the previous execution remained on the queue, even while servers were idle, and the queue remained at 2.
- I "killed" the node servers again, and interrupted/restarted the client. At the time there were 3 "phantom" idle servers, one with two threads and two with one thread each (capacity for 4 tasks).
- In the new process, the initial submission (before any real servers connected) looked like this:
16:24:50,833 INFO [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
16:24:50,834 INFO [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
16:24:51,216 INFO [ServerMonitor] - Job short-task-assessors queued. Queue now has 2 tasks.
16:24:51,324 INFO [ServerMonitor] - Job short-task-assessors started
16:24:51,438 INFO [ServerMonitor] - Job short-task-assessors dispatched 1 task to 10.176.69.11. Queue now has 2 tasks.
16:24:51,556 INFO [ServerMonitor] - Job short-task-assessors dispatched 1 task to 10.176.68.107. Queue now has 2 tasks.
16:24:51,940 INFO [ServerMonitor] - Job long-task-assessors queued. Queue now has 8 tasks.
16:25:10,118 INFO [ServerMonitor] - Job short-task-assessors dispatched 2 tasks to 10.176.69.13. Queue now has 6 tasks.
Note there were still 2 "phantom" tasks on the queue from 2 runs ago. So the initial short-task job added 2 more (queue 4) and then deducted 2 for the dispatches to each of the "phantom" 1-thread nodes. Then the additional 8 tasks in the long-task job arrived, and then 2 short-task jobs (perhaps the same ones that just failed on the other two nodes) were dispatched to the "phantom" 2-thread node. The queue ends at 6 (when it is in actuality 8): the 8 I submitted PLUS the initial 2 leftover jobs MINUS the 4 threads available on "non-existent" Predictably, this run ended with a queue length of -2.
EDIT AGAIN:
Decided to try it yet another time with no "phantom" servers but starting with the -2 queue length. Throughout the run, queue lengths were offset by 2 from where they were expected, e.g.,
17:00:23,512 INFO [AssessorsClient] - Dispatching 6 tasks for job long-task-assessors.
17:00:23,513 INFO [AssessorsClient] - Dispatching 2 tasks for job short-task-assessors.
17:00:23,846 INFO [ServerMonitor] - Job short-task-assessors queued. Queue now has 0 tasks.
17:00:24,778 INFO [ServerMonitor] - Job long-task-assessors queued. Queue now has 6 tasks.
So there appear to be multiple issues involved here: leftover tasks from the previous run stay on the queue's total count (even when they will never be processed), negative queue length carries over to the next run, and nonexistent idle servers deduct from the queue even when tasks never process on them. Both of these situations survive a stats reset(). Obviously restarting the driver fixes it all...
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
Really delete this comment?
server tats view
Really delete this comment?
Really delete this comment?
Really delete this comment?