Dear list,
I've been working on removing Beaker's dependence on Cobbler for provisioning systems. This mail is to describe the approach I am proposing, and to seek feedback on it.
At present, Beaker requires cobblerd to be running on each lab controller. When a new distro comes along, it must be imported into Cobbler. From there, a script is run to register new Cobbler distros with the Beaker server. (Bill has been doing some work on this side of things separately.)
When it comes time to reboot or power off a system -- either because a user has manually requested it, or the scheduler is starting/stopping a job -- the Beaker server makes a series of XML-RPC calls to cobblerd on the lab controller, requesting that it execute the appropriate power control script. The parameters for the power script are based on the system's power settings stored in Beaker.
Similarly, provisioning a distro on a system means making a series of XML-RPC calls to cobblerd to configure netbooting for the system and then rebooting it.
As a first step towards removing Cobbler, I am tackling the power command side of things. Since version 0.6.14 Beaker already has a per-system queue of power commands, to handle the fact that some systems take a very long time to power on and off. Right now a dedicated thread runs in beakerd (on the Beaker server) processing new power commands, and checking the status of running ones. In each case it has to make an XML-RPC call back to cobblerd on the lab controller.
We can flip this relationship on its head, by making the lab controller poll the Beaker server for new power commands. When it sees one, it executes the power script and reports the result back. This is similar to the way the existing beaker-watchdog daemon works: it polls the server periodically for new and expired watchdog records and acts on them.
The main advantage to this approach is that we no longer need bi-directional communication between lab controllers and the Beaker server. Instead, all requests come from the lab controllers to the server. If a lab controller goes down, or becomes unreachable, errors do not pile up on the Beaker server. The queue for systems in that lab will simply not progress until the lab controller comes back online, at which point it should be able to recover gracefully. It will also allow labs to be behind a NAT. Plus it is more efficient to have the lab controller report back when a power script is finished, rather than having the Beaker server polling cobblerd to check the status of the command.
I have a proof-of-concept patch which implements power command handling. You can view and comment on the patch here:
http://gerrit.beaker-project.org/912
In this patch I have used gevent, which is a library for event-driven asynchronous programming using "greenlets". I wanted to avoid using threads for supervising the power scripts, because in a large lab with hundreds of test systems and many power commands running concurrently, having a thread per command (actually at least two, one to read from stdout and one to read from stderr) would waste a lot of memory. I chose gevent over Twisted because a lot of existing code can be used as-is, without porting all of it to Twisted. You can read more about gevent here:
Still on my todo list for this patch:
* implement timeouts for the power scripts, so that they can't run forever and never return
* add optional support for receiving power commands over AMQP (as an optimisation for polling), like the beaker-watchdog daemon currently has
* make the daemon shutdown cleanly: if there are any power scripts running, they should be allowed to complete and report their result before being killed
As a next step towards removing Cobbler, we can expand the command queue to include provisioning commands and have the new beaker-provision daemon process those also. I will be working on this next.
On 01/31/2012 07:06 AM, Dan Callaghan wrote:
Dear list,
I've been working on removing Beaker's dependence on Cobbler for provisioning systems. This mail is to describe the approach I am proposing, and to seek feedback on it.
At present, Beaker requires cobblerd to be running on each lab controller. When a new distro comes along, it must be imported into Cobbler. From there, a script is run to register new Cobbler distros with the Beaker server. (Bill has been doing some work on this side of things separately.)
When it comes time to reboot or power off a system -- either because a user has manually requested it, or the scheduler is starting/stopping a job -- the Beaker server makes a series of XML-RPC calls to cobblerd on the lab controller, requesting that it execute the appropriate power control script. The parameters for the power script are based on the system's power settings stored in Beaker.
Similarly, provisioning a distro on a system means making a series of XML-RPC calls to cobblerd to configure netbooting for the system and then rebooting it.
As a first step towards removing Cobbler, I am tackling the power command side of things. Since version 0.6.14 Beaker already has a per-system queue of power commands, to handle the fact that some systems take a very long time to power on and off.
Could we squeeze the queued calls (except the pending one) into one?
- Reboot+Reboot -> Reboot - Reboot+Power-off -> Power-off - Reboot+Power-on -> Reboot - Power-off+Power-on -> Reboot - ...
In some case we can include pending call in the game as well:
- [Reboot] + Reboot -> [Reboot] - [Reboot] + Power-on -> [Reboot] - ...
Looks like a simple filter would do...
Right now a dedicated thread runs in beakerd (on the Beaker server) processing new power commands, and checking the status of running ones. In each case it has to make an XML-RPC call back to cobblerd on the lab controller.
We can flip this relationship on its head, by making the lab controller poll the Beaker server for new power commands. When it sees one, it executes the power script and reports the result back. This is similar to the way the existing beaker-watchdog daemon works: it polls the server periodically for new and expired watchdog records and acts on them.
From my side just one thing: is Provisioning in the same queue as Power-commands?
This affects Answers to following two Quesitons: - When does beaker see the machine as free? - When is PXE boot on LC set?
Both should happen only after previously queued power-commands complete and the one issued by caller succeeds.
If not, imagine following:
Scenario:
[Impatient user here] Reboot [Watch, nothing's going on] Reboot [Watch for a bit longer] Return the machine to beaker Beaker schedules recipe, sets PXE boot on cobbler, and now the sequence of reboots would take place instead of the job...
-- Marian
The main advantage to this approach is that we no longer need bi-directional communication between lab controllers and the Beaker server. Instead, all requests come from the lab controllers to the server. If a lab controller goes down, or becomes unreachable, errors do not pile up on the Beaker server. The queue for systems in that lab will simply not progress until the lab controller comes back online, at which point it should be able to recover gracefully. It will also allow labs to be behind a NAT. Plus it is more efficient to have the lab controller report back when a power script is finished, rather than having the Beaker server polling cobblerd to check the status of the command.
I have a proof-of-concept patch which implements power command handling. You can view and comment on the patch here:
http://gerrit.beaker-project.org/912
In this patch I have used gevent, which is a library for event-driven asynchronous programming using "greenlets". I wanted to avoid using threads for supervising the power scripts, because in a large lab with hundreds of test systems and many power commands running concurrently, having a thread per command (actually at least two, one to read from stdout and one to read from stderr) would waste a lot of memory. I chose gevent over Twisted because a lot of existing code can be used as-is, without porting all of it to Twisted. You can read more about gevent here:
http://www.gevent.org/
Still on my todo list for this patch:
implement timeouts for the power scripts, so that they can't run forever and never return
add optional support for receiving power commands over AMQP (as an optimisation for polling), like the beaker-watchdog daemon currently has
make the daemon shutdown cleanly: if there are any power scripts running, they should be allowed to complete and report their result before being killed
As a next step towards removing Cobbler, we can expand the command queue to include provisioning commands and have the new beaker-provision daemon process those also. I will be working on this next.
Beaker-devel mailing list Beaker-devel@lists.fedorahosted.org https://fedorahosted.org/mailman/listinfo/beaker-devel
Excerpts from Marian Csontos's message of Tue Jan 31 17:40:45 +1000 2012:
Could we squeeze the queued calls (except the pending one) into one?
- Reboot+Reboot -> Reboot
- Reboot+Power-off -> Power-off
- Reboot+Power-on -> Reboot
- Power-off+Power-on -> Reboot
- ...
In some case we can include pending call in the game as well:
- [Reboot] + Reboot -> [Reboot]
- [Reboot] + Power-on -> [Reboot]
- ...
Looks like a simple filter would do...
We could add some logic on the server side to optimise the queue when new commands are added. I don't think the beaker-provision daemon needs to worry about this. However if we are removing/altering commands on the server side while they are queued, we have to be careful not to introduce a race between the time beaker-provision fetches the command and starts running it.
I don't think the scheduler will ever enqueue consecutive power commands, except if a system is powered off at the end of a recipe and then immediately rebooted for the next recipe. That becomes the same as power off + power on, which for most power types is equivalent to a reboot. So I don't think there's much to be gained for the scheduler.
However a user could manually queue up lots of power commands using the web UI -- in that case, maybe we should honour them even if they make no sense (such as power on + power off, or reboot five times, or similar)? Is there any reason why someone would want to enqueue multiple power commands, apart from "I am impatient so I will mash this button"? A better approach would be to add some sanity checking in the web UI, or else just assume the user knows what they are doing and honour it.
From my side just one thing: is Provisioning in the same queue as Power-commands?
This affects Answers to following two Quesitons:
- When does beaker see the machine as free?
- When is PXE boot on LC set?
Both should happen only after previously queued power-commands complete and the one issued by caller succeeds.
If not, imagine following:
Scenario:
[Impatient user here] Reboot [Watch, nothing's going on] Reboot [Watch for a bit longer] Return the machine to beaker Beaker schedules recipe, sets PXE boot on cobbler, and now the sequence of reboots would take place instead of the job...
Right now configuring netboot is done synchronously, but power commands (including reboot for provisioning) are done asynchronously through the command queue. So the scenario you describe is possible right now.
We've already agreed that this is not ideal, and that provisioning should be done through the command queue to avoid those kinds of problems. That will be happening for the Cobbler removal effort (or even sooner).
beaker-devel@lists.fedorahosted.org