Update testing policy: how to use Bodhi

Fri Mar 26 22:49:28 UTC 2010

Hi, folks. At the last QA meeting, I volunteered (dumb of me!) to draft
a policy for testing updates - basically, a policy for what kind of
feedback should be posted in Bodhi for candidate updates.

This turns out to be pretty hard. =) Thinking about it from an
high-level perspective like this, I think it becomes pretty clear that
the current system is just broken.

The major problem is it attempts to balance things that don't really
balance. It lets you say 'works for me' or 'doesn't work' and then sums
the two and subtracts the second from the first to give you a 'rating'
for the update.

This doesn't really mean anything. As has been rehashed many times,
there are situations where an update with a positive rating shouldn't go
out, and situations where an update with a negative rating should. So
the current system isn't really that great.

I can't think of a way to draft a policy to guide the use of the current
system in such a way that it will be really reliable. I think it'd be
much more productive to revise the Bodhi feedback system alongside
producing a policy.

So, here's a summary of what the new system should aim for. 

At the high level, what is this system for? It's there for three
purposes:

1) to provide maintainers with information they can use in deciding
whether to push updates.

2) to provide a mechanism for mandating a certain minimum level of
manual testing for 'important' packages, under Bill Nottingham's current
update acceptance criteria proposal.

3) to provide an 'audit trail' we can use to look back on how the
release of a particular update was handled, in the case where there are
problems.

Given the above, we need to capture the following types of feedback, as
far as I can tell. I don't think there is any sensible way to assign
numeric values to any of this feedback. I think we have to trust people
to make sensible decisions as long as it's provided, in accordance with
any policy we decide to implement on what character updates should have.

1. I have tried this update in my regular day-to-day use and seen no
regressions.

2. I have tried this update in my regular day-to-day use and seen a
regression: bug #XXXXXX.

3. (Where the update claims to fix bug #XXXXXX) I have tried this update
and found that it does fix bug #XXXXXX.

4. (Where the update claims to fix bug #XXXXXX) I have tried this update
and found that it does not fix bug #XXXXXX.

5. I have performed the following planned testing on the update: (link
to test case / test plan) and it passes.

6. I have performed the following planned testing on the update: (link
to test case / test plan) and it fails: bug #XXXXXX.

Testers should be able to file multiple types of feedback in one
operation - for instance, 4+1 (the update didn't fix the bug it claimed
to, but doesn't seem to cause any regressions either). Ideally, the
input of feedback should be 'guided' with a freeform element, so there's
a space to enter bug numbers, for instance.

There is one type of feedback we don't really want or need to capture:
"I have tried this update and it doesn't fix bug #XXXXXX", where the
update doesn't claim to fix that bug. This is a quite common '-1' in the
current system, and one we should eliminate.

I think Bill's proposed policy can be modified quite easily to fit this.
All it would need to say is that for 'important' updates to be accepted,
they would need to have one 'type 1' feedback from a proven tester, and
no 'type 2' feedback from anyone (or something along those lines; this
isn't the main thrust of my post, please don't sidetrack it too
much :>).

The system could do a count of how many of each type of feedback any
given update has received, but I don't think there's any way we can
sensibly do some kind of mathematical operation on those numbers and
have a 'rating' for the update. Such a system would always give odd /
undesirable results in some cases, I think (just as the current one
does). I believe the above system would be sufficiently clear that there
would be no need for such a number, and we would be able to evaluate
updates properly based just on the information listed.

What are everyone's thoughts on this? Thanks!
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Fedora Talk: adamwill AT fedoraproject DOT org
http://www.happyassassin.net