JPPF Issue Tracker
star_faded.png
Please log in to bookmark issues
feature_request_small.png
CLOSED  Feature request JPPF-303  -  Let node shutdown and reprovisioning respect executing nodes
Posted Aug 07, 2014 - updated Sep 06, 2014
action_vote_minus_faded.png
0
Votes
action_vote_plus_faded.png
icon_info.png This issue has been closed with status "Closed" and resolution "RESOLVED".
Issue details
  • Type of issue
    Feature request
  • Status
     
    Closed
  • Assigned to
     lolo4j
  • Type of bug
    Not triaged
  • Likelihood
    Not triaged
  • Effect
    Not triaged
  • Posted by
     Daniel Widdis
  • Owned by
    Not owned by anyone
  • Category
    Node
  • Resolution
    RESOLVED
  • Priority
    Normal
  • Targetted for
    icon_milestones.png JPPF 5.0
Issue description
The current implementations of shutdown(), and of provisionSlaveNodes() when the number of requested slave nodes is less than the current number, result in immediately stopping nodes (or some slave nodes), even if they are currently executing tasks. While the driver recognizes this and resubmits the tasks, this loses any progress the nodes had previously made in calculations.

It is possible to work around this limitation from the driver by toggling the node's (or some slave nodes') active state, waiting for it/them to complete a task, and then sending shutdown(). However, this method adds unneded complexity.

We propose to add a boolean parameter "onTaskComplete" to both provisionSlaveNodes() and shutdown() specifying that the action is intended to be taken when the current task(s) executing on the node complete(s). When the parameter is false (or omitted), the current behavior is maintained. When the parameter is true, the node shutdown (or the stopping of some of the slave nodes) is delayed until tasks are completed on the nodes marked to be stopped. (Or, better, the first n-m slaves to finish tasks after the request are the ones which are stopped, rather than selecting the nodes to stop a priori.) When applied to a master node, shutdown(true) should also be dependent upon all tasks on associated slave nodes completing.

#2
Comment posted by
 lolo4j
Aug 08, 08:12
These are very good ideas. They will require a serious redesign of the provisioning feature, and due to this they cannot be implemented in a maintneance release. As a rule, we never include any new feature in a maintenance release. Sometimes there can be an enhancement if it doesn't break the existing design. I'm thus pushing this feature to the 5.0 release.

The current design of the provisioning is that the master node only knows its slaves as local processes (as in executables) on the same machine. Basically a java.lang.Process object. When a slave node is closed, it is not shutdown as with the JMX operation, it is actually killed with a Process.destroy() call, the equivalent of (I think) a kill -9 on Linux. So currently the master doesn't know what its slaves are doing. It can just start them, know when they die and kill them. One consequence of that is that the master does not know the JMX port of its slaves, which prevents any management operation from occurring within a master/slaves group.

On the other hand, there is a mechanism which creates a TCP connection between a master and each of its slaves. Each slave reads on that connection, but nothing is ever sent by the master. The intent of this mechanism is that, when a master dies, this will cause an IOException in each slave, which will then automatically close itself. This avoids having Java processes hanging on the local machine. We should be able to expand this mechanism to handle a basic protocol, allowing the master to tell the node whether to shutdown immediately or to wait until the current tasks are complete. Here, toggling the node's active state cannot be done, because this state is only maintained on the server side: it means whether the node is available for job scheduling or not. So we'd need to slightly modify the protocol between node and server, simply adding a flag in the node's response header, to prevent the server from scheduling a new job on that node in the time between the node response being sent and the node actually shutting down.

In comparison, it will be much easier to add an "onTaskComplete" flag to the shutdown() JMX call.
#4
Comment posted by
 Daniel Widdis
icon_reply.pngAug 11, 10:16, in reply to comment #2
lolo4j wrote:
These are very good ideas. They will require a serious redesign of the
provisioning feature, and due to this they cannot be implemented in a
maintneance release. As a rule, we never include any new feature in a
maintenance release. Sometimes there can be an enhancement if it doesn't
break the existing design. I'm thus pushing this feature to the 5.0
release.



Thanks for considering this.

I didn't think it was appropriate for a maintenance release but the form demanded I enter a target, and 4.3, which I thought it might fit into, was not available. I will expand on the use case of how/why I would find such a feature useful.

In my environment (Rackspace) I pay-per-minute for my nodes, so it is important to me to structure my network in such a way as to shut down nodes as soon as possible when they complete their work. When my queue gets short I want to restructure some servers on which I have excess nodes, in order to make sure each node is using the full CPU to finish the work more quickly (e.g., I run a master+3 slave nodes on a 2GB/2CPU machine, but when the queue gets short I want to reduce that "smoothly" to only 2 nodes; letting the 3rd and 4th nodes work complete without starting new tasks on them.)

At present I have a monitoring process detect when the queue is empty and nodes are idle, and shuts them down. But it would be nice to just do a single loop when the queue empties to tell every node to shut down when it's done. And even better, for nodes to be able to to a final task upon shutting down (sending the API a request to delete themselves, although this can probably be server-side scripted...)
#6
Comment posted by
 lolo4j
Sep 04, 20:16
implemented the JMX part in trunk revision 3347 (including admin console updates and doc)
#7
Comment posted by
 lolo4j
Sep 06, 15:27
Implemented the provisioning part in trunk revision 3351

Incidentally, this will make it very easy to implement Feature request JPPF-6 - Improvements for nodes in idle mode