Defining a harness API
by Dan Callaghan
An important first step towards supporting alternative harnesses and/or
the mythical Beaker Simple Harness is coming up with a stable,
documented API for the harness to interact with Beaker. So I'm starting
this thread now to get the ball rolling.
My first thoughts:
* It should have the smallest possible surface area -- just enough to
expose all of Beaker's functionality and nothing more.
* It should be defined from the point of view of lab controller <-> test
system, not scheduler <-> test system. Corollary: we might need to
start treating the lab controller as a first-class citizen instead of
just a dumb proxy to the scheduler.
* Just because we use XML-RPC now doesn't mean it's the best choice,
and doesn't mean we need to keep using it.
* In particular, the HTTP protocol probably supports everything we need
to build the API (in other words, it can be a "RESTful API" even
though I hate that phrase). For example logs could be uploaded using
HTTP PUT (with Content-Range), which means if we wanted we could
potentially use Apache mod_dav_fs to efficiently write these directly
to the filesystem without any intervening Beaker code.
There are four areas the API needs to cover (that I can think of). The
harness needs to be able to:
* find out what to run
* report results and upload logs
* extend the watchdog time
* synchronize with other recipes in the recipe set
This last area is a new one, it's currently handled entirely in beah and
there is no API on the Beaker side for it. But the dynamic FQDNs in
Beaker 0.10 mean it is now needed, even if it's as simple as having
a call to wait for all other harnesses in the recipe set to check in.
Thoughts?
--
Dan Callaghan <dcallagh(a)redhat.com>
Software Engineer, Infrastructure Engineering and Development
Red Hat, Inc.
11 years, 2 months
Scheduling recipe sets rather than recipes
by Nick Coghlan
The current scheduler works almost purely at the recipe level. The
extent to which it pays attention to recipesets pretty much amounts to
ensuring all recipes in a recipe set are scheduled on the same lab
controller.
This creates some interesting problems with multi-host testing:
- a recipe set with strict host requirements for only some systems may
hold on to common systems for a long time while waiting for rare ones
(the addition of dynamic virt support opens the door for a recipe set to
hold on to dynamic virt resources while waiting for physical hardware
for other recipes)
- recipe sets scheduled for unique systems may deadlock if a high
priority job is competing with a previously queued low priority job
which has already claimed some resources
To better explain the latter problem, consider a lab with only 2
systems, A and B, containing a particular piece of hardware, and a
multi-host recipe set that needs both of them. Queue a low priority
version of that job while a test is running on system A, and the job
will claim system B immediately for one recipe, while the other will
remain in the queue. If a high priority copy of the job is added before
the test running on system A completes, then system A will be claimed by
Job 2. This leaves the two jobs in a classic ABBA deadlock, as Job 1 has
System B and is waiting for System A, while Job 2 has System A and is
waiting for System B.
Some of the metrics support being added in 0.11 is actually about
measuring the overall impact of the first problem (by seeing what
proportion of their time systems spend in the Scheduled state).
For other reasons to do with being able to effectively partition the
scheduling task between multiple schedulers each handling the systems
managed by a particular lab controller, I've been considering proposing
the inclusion of a "Claimed" state in the recipe lifecycle. The
"Claimed" state would fit between "Queued" and "Scheduled", and indicate
that the recipe had been assigned to a specific lab controller, but not
yet assigned to a specific system (at the moment, this state change is
handled implicitly through setting "recipe.recipeset.lab_controller"
when the first recipe in the recipeset is scheduled).
Furthermore, the scheduler would be updated to work on a *cached* copy
of the System status data. This is needed to avoid the current problem
where there's a race condition with system status changes occurring
during a scheduling pass leading to recipes jumping the queue (I'm
interested in hearing about relatively clean ways to this with SQL
Alchemy, though:
http://stackoverflow.com/questions/13983067/cached-reads-immediate-writes...)
In combination, these two would allow Claimed recipes to be given
priority over Queued recipes on subsequent passes, preventing the
deadlock problem and theoretically also improving system utilitization.
One social challenge with addressing this is that we don't want to
enable/encourage queue jumping for rare systems by scheduling them in a
recipe set with a job that will be scheduled quickly, but I'm not sure
we can solve that at the technical level.
Cheers,
Nick.
--
Nick Coghlan
Red Hat Infrastructure Engineering & Development, Brisbane
Python Applications Team Lead
Beaker Development Lead (http://beaker-project.org/)
GlobalSync Development Lead (http://pulpdist.readthedocs.org)
11 years, 4 months
Design sketch: RPM task for running beah/beakerlib tests from Git repos
by Nick Coghlan
I was pondering the question of running tests from Git repos recently,
and Dan's recent efforts in resurrecting patchbot (to sanity check
patches on Gerrit), which kinda does exactly that for our own dogfood
tests, prompted me to post my ideas for people to poke holes in :)
Goal:
Allow developers to run tests based on the existing results reporting
infrastructure directly from Git, without the need to build a test RPM
Benefits:
- eliminates a step in the test development workflow (bkr task-add)
- avoids versioning issues when updating tests
- potentially helps with VM-image-library-based testing (since the
runtest.sh in a "run from Git" task gives us another location for
harness code execution, independent of kickstart %post snippets)
(Deliberate) Limitations:
- retains the dependency on beah/beakerlib for setting up the
environment and reporting results
- thus doesn't help with the cross-platform testing problem (and, to be
frank, I don't think we *should* ever try to solve that problem directly
- instead, we eventually need to figure out how to integrate STAF,
autotest or both as alternatives to beah for the components that run on
the system under test. That's a much harder problem than simply allowing
ordinary beak/beakerlib tests to be executed from Git, though, since the
differences in execution and reporting models would need to be aligned
somehow)
Design Details:
The proposal is fairly simple:
- create a new standard task (maintained in the main beaker repo) that
accepts parameters defining:
- a Git URL to clone
- a test execution command to be run from the base directory of the
clone
- the task's runtest.sh would take care of any setup-and-teardown needed
to clone the repo, run the test and then delete the repo again
- the exact Git changeset id for the checked out repo would be reported
as part of the test details (for cases where the submitted URL doesn't
specify a particular tag or revision)
- (What else would we need in the task parameters to make up for the
lack of per-test-case tasks? Probably everything that would otherwise be
set in runtest.sh)
Possible additions:
- while the standard version of the task would permit arbitrary Git
URLs, it likely wouldn't be hard to create a modified version that only
allowed URLs from a defined subset of hosts.
Actually implementing this isn't high on my priority at the moment, but
if the above idea seems workable, then it should be a lot easier to make
happen than an approach that requires server side changes.
Cheers,
Nick.
--
Nick Coghlan
Red Hat Infrastructure Engineering & Development, Brisbane
Python Applications Team Lead
Beaker Development Lead (http://beaker-project.org/)
GlobalSync Development Lead (http://pulpdist.readthedocs.org)
11 years, 4 months
Support for extracting metrics data via raw SQL
by Nick Coghlan
Something we're focusing on at the moment is improving the ability to
extract metrics data from a Beaker installation without relying on the
main server (either via the web UI or the XML-RPC interface). The
problem with relying on either of those is that some of the more
interesting queries can be quite resource intensive, and end up
interfering with the operation of the job scheduler.
For the more volatile metrics (like current system utilisation and the
state of the recipe queue), Beaker 0.11 will be sending several
additional signals to Graphite. We likely won't have a nice dashboard
for those in this release (instead relying on a few direct links to
appropriately designed graphs in Graphite web UI), but creating a
"Beaker Dashboard" for an installation is definitely on the cards for
the subsequent release.
For the more resource intensive queries though, I'd like to be able to
rely on data aggregation systems like Teiid and business reporting tools
like Jasper Reports. That means:
1. Identifying the Beaker metrics which we think are interesting (and
aren't covered by the Graphite data)
2. Figuring out how to extract those from the database schema
3. Figuring out how to publish them to Beaker users in a way that allows
them to be used in a reporting system like Jasper, but won't be a
nightmare for us to maintain
Amit's patch at http://gerrit.beaker-project.org/#/c/1546 is a decent
attempt at 1 and 2, but ultimately fails 3.
Amit, Dan and I spent some time discussing this on IRC, so here's what
we're currently thinking:
1. Create a new location in the Beaker source repo for metrics .sql files
2. Have a section in the admin guide on metrics extraction (as Amit's
patch does), but:
- drop the "Job Congestion Measurement" use case (already covered by
the Graphite metrics)
- drop the "Hardware Utilization and Coverage" use case (this part
of the schema is seriously complicated, so it's better to use the Search
UI on the main server. Perhaps mention using that UI to craft a query
and then adding "&tg_format=atom&list_tgp_limit=0" to get the results as
a machine readable list of links as per
http://beaker-project.org/server-api/http.html#system-inventory-information)
- modify the User Accounting section to focus on specific
architectures, rather than distro versions
- adjust the remaining sections to reference the appropriate .sql
file instead of including the SQL in line
- add a caveat noting that the details of these queries may change
between releases, but this should always be mentioned in the release notes
3. For each .sql file, have a test case which runs that SQL and checks
it gives the same answer as the SQL Alchemy model
Cheers,
Nick.
--
Nick Coghlan
Red Hat Infrastructure Engineering & Development, Brisbane
Python Applications Team Lead
Beaker Development Lead (http://beaker-project.org/)
GlobalSync Development Lead (http://pulpdist.readthedocs.org)
11 years, 4 months