JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
enhancement_small.png
CLOSED  Enhancement JPPF-375  -  Improve the efficiency of the adaptive load-balancers for very short tasks
Posted Mar 10, 2015 - updated Sep 12, 2015
action_vote_minus_faded.png
0
Votes
action_vote_plus_faded.png
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Enhancement
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     lolo4j
  • Owned by
    Not owned by anyone
  • Category
    Performance
  • Resolution
    RESOLVED
  • Priority
    Normal
  • Targetted for
    icon_milestones.png JPPF 5.1
Issue description
I noticed that, when submitting jobs with very short-lived (sub-millisecond for my current hardware) tasks, and using one of the built-in adaptive load-balancing algorithms ("proportional", "autotuned" or "rl"), there is a large variance on the number of tasks executed by each node, even when all the nodes are idempotent.

From my experiments, it appears that the main limiting factor is the system timer resolution. On my machine, it is at 0.5 ms, which implies that any time measured in Java (or any language for that matter) will be either 0 or >= 500,000 nanos. For instance, it looks like Thread.sleep(0L, 500000) has exactly the same effect as Thread.sleep(1L).

Since the adpative algorithms use the nodes' past performance for their computations, and since, due to the timer resolution limitation, the performance numbers are inherently inaccurate below a given threshold for the mean duration of the tasks ... Houston, we've got a problem

#3
Comment posted by
 lolo4j
May 14, 08:38
From my experiments, I observed that System.nanoTime() has a very good accuracy. Its accuracy is actually that of of a system clock tick, which resolves to 285 nanoseconds on my fastest machine. Since the load balancers use System.nanoTime() for time intervals measurement, this can't be the problem. Thus the remaining suspect has to be the load-balancer bootstraping.

Example scenario: the driver is configured with "proportional" alorithm and an initial size of 10. We have 2 idempotent nodes, and we submit a job with 100 tasks. In the driver, the following sequence occurs:
  • the first dispatch to node 1 is 10 tasks (initial size)
  • then first dispatch to node 2 is also 10 tasks
  • execution of tasks in node 1 ends before node 2 has finished executing its 10 tasks
  • the load-balancer instance for node 1 updates itself, with a better performance than node 2
  • next dispatch to node 1 has a much larger number of tasks, possibly all remaining 80 tasks
  • tasks from node 2 are returned
  • tasks for second dispatch from node 1 are returned; since there were more tasks, the measured performance is even better, increasing the imbalance between the 2 nodes
Since the "proportional" algorithm is deterministic (there is no random value or calculation involved) it's logical (in hindsight of course) that it would be sensitive to this bootstraping issue. I still need to experiment with the other adaptive algorithms, "rl" and "autotuned", to see how their behovior compares. A solution might be to add some random "noise" to the calculations, for insance adding or subtracting a small random amount to the measured performance, optionally until some condition is reached, such as a number of executed tasks by the corresponding node.

#4
Comment posted by
 lolo4j
May 14, 09:14
I've run a test with the following scenario:
  • 10,000 jobs with 1,000 tasks each
  • each task has a basically empty run() method
  • jobs are submitted concurrently with a maximum of 10 concurrent jobs, using the job streaming pattern
  • 10 nodes all on the same physical machine
For each adaptive load-balancing algorithm, I ran the job and measured the minimum and maximum number of tasks executed on a node, the maximum difference between 2 nodes (max - min) and the total execution time on the client side:
"autotuned" algorithm
 
min tasks:   862,753
max tasks: 1,219,185
max spread:  356,432
total time: 4mn 0.497s
 
-------------------------
 
"rl" algorithm
 
min tasks:   844,462
max tasks: 1,152,143
max spread:  307,681
total time: 4mn 3.794s
 
-------------------------
 
"proportional" algorithm
 
min tasks:   614,354
max tasks: 1,505,756
max spread:  891,402
total time: 3mn 30.135s
Interestingly, the "proportional" algorithm yields the best overall performance, even though it has the greater imbalance ...