Review of Fedora 18 Release Criteria

Tue Oct 9 23:07:54 UTC 2012

On Tue, 2012-10-09 at 16:48 -0400, David Cantrell wrote:
> On Tue, Oct 09, 2012 at 07:19:25AM -0600, Tim Flink wrote:
> > As we're getting closer to the scheduled time for beta freeze, we'd
> > like to find out now if any of the current criteria or proposed criteria
> > changes are unreasonable to expect for beta. There may be more changes
> > for final as we get closer to that but I think that we're pretty close
> > to being done with the release requirements for beta.
> > 
> > The current (as of this writing) release criteria are available at:
> >  - http://fedoraproject.org/wiki/Fedora_18_Alpha_Release_Criteria
> >  - http://fedoraproject.org/wiki/Fedora_18_Beta_Release_Criteria
> >  - http://fedoraproject.org/wiki/Fedora_18_Final_Release_Criteria

Thanks David! Some thoughts follow.

> I would like to see changes to the blocker criteria for each release.  The
> first item on each release criteria is that all blockers must be CLOSED.
> Blockers are determined by criteria defined below which always group
> anaconda in because we cannot address those problems in a later update
> release.  This gets us on the bug fixing treadmill as we edge closer to each
> release because every anaconda bug more or less becomes a blocker.  

This paragraph was a bit tricky to read, but now I've given it a few
tries, it seems to be more or less a preamble, yes? I'm not sure if
you're suggesting that "Blockers are determined by criteria defined
below which always group anaconda in because we cannot address those
problems in a later update release.  This gets us on the bug fixing
treadmill as we edge closer to each release because every anaconda bug
more or less becomes a blocker." is a problem, or just mentioning it as
background. It's perfectly true as background, but I don't see it as a
problem: it's just an innate characteristic of the software you write.
The installer is something that cannot be updated (for practical
purposes), it must work to a high standard as shipped, because if it
doesn't, that's a much bigger problem than a component which _can_ be
updated not working. I agree with your assessment, but I see it as an
inherent characteristic of an operating system installer, not any kind
of problem in the process.

> What I
> would like to see:
> 
> 1) Installer blocker criteria needs to be more and more restrictive the
> closer we get to a final release.

This wasn't entirely clear to me, but I'm going to take a guess at what
I think you mean and reply to that. I think you're looking at the
situation where we get late into the validation process - say, we just
built RC2 and it's two days to go/no-go - and we find five bugs and mark
them as blockers. I'm guessing you're saying it'd be preferable to
identify blockers early and we should only add issues to the blocker
list late if they're _really bad_, because otherwise you just keep
fixing blockers. On that basis...

I appreciate that the 'blocker treadmill', as you describe it, can be
frustrating. But I don't think 'let's just not count bad issues as
blockers late in the process to give the developers a break' is the
answer to anything (except possibly 'how can we stop Will depleting the
U.S. strategic gin reserve?', but that's not the question this post was
trying to answer :>). What we're trying to do with the release
validation process as a whole is provide a clear framework for defining
the standards our releases should meet and a clear process for building
releases that meet those standards and verifying that they meet those
standards. I don't see that adding an element of time sensitivity to the
blocker evaluation process - 'issues of the type X are blockers if we
find them four weeks before release, but not if we find them one week
before release' - is a good way to achieve this. 

'Blocker bugs' are just the 'release quality' question inverted: they
are the ways in which our releases must not be broken in order to meet
the minimum quality standards we've decided on. An issue which causes us
to fall below our minimum quality standards is a problem no matter when
it's discovered. I absolutely understand that it makes things easier for
the developers if we catch blocker bugs early, and we agree this is an
important goal and we have made and will continue to make efforts to
improve our ability to catch blockers as early as possible. I know it
sucks when we're on RC3 and we suddenly discover a major bug. But it's
still a major bug, and 'say it's not a blocker because we're late in the
process' doesn't sound like a good response to that suckage, to me. I
don't want to do that.

I believe we should set realistic minimum standards - those that are
achievable with the level of development resources we have in place, on
the release schedules Fedora is committed to. What this thread is about,
essentially, is checking that we are not currently setting that bar too
high, and demanding from you more than you have the resources to
possibly provide in the time available. We certainly believe that we
need input from the development teams to know where the bar should be
set. But I do believe the bar should be a bar, not a fuzzy field that
can be adjusted with excessive pragmatism. We should set realistic
standards, but they have to be solid ones that we don't compromise just
because time is short or the developers are getting tired of fixing
bugs.

What we (QA) as a team do try and do in those cases is look at the
situation and think what we could do in future to ensure the blocker
would get caught earlier. For instance, in the last few releases we've
been making a more concerted effort to complete testing even on TC/RC
builds that have obvious showstoppers - to catch the other bugs 'behind'
the showstoppers, rather than just catching the showstoppers and then
focusing work on getting them fixed, then continuing on with testing of
other functionality.

I don't mean to start a finger-pointing match, but I do think it's worth
bearing in mind that the 'blocker treadmill' is much more likely to
happen when there are major changes to anaconda, because these massively
increase the surface area of code that's prone to causing blocker bugs.
When we do a release, we can say with a reasonable degree of certainty
that the code in that release probably contains very few blocker bugs -
only ones we didn't catch in the validation process it just went
through. 

If we then do another release in which that code isn't changed very
much, well, we aren't likely to have two hundred new blocker bugs. 

But if we (Fedora) do, oh, let's just say as _entirely theoretical
examples_, rewrite the entire storage backend, or replace the entire
first stage of the whole installer, or rewrite the entire user
interface...we've just thrown out all the code that's relatively well
known to be 'blocker free', and replaced it with an entirely new chunk
of code about which we know just about nothing from a quality
perspective. Statistically speaking, no matter how awesome the person or
people writing it, that new chunk of code is very likely to contain more
blockers than the code it replaced. Major changes to the code inevitably
result in more blockers being present, and thus more blocker treadmill,
than light-touch maintenance of a mature codebase does. We (QA) are
always going to be able to find ten blockers in a well-known codebase
much faster than we can find two hundred blockers in a heavily revised
codebase.

Certainly QA has some responsibility for the 'blocker treadmill', as I
noted above, it's our responsibility to try and identify blocker bugs as
early and as quickly as possible, and this is something we can and
should always look to improve. But developers also have responsibility
for it. If you're stuck on a 'blocker treadmill' it could be an
indication that QA could and should have discovered the blocker bugs
faster, but it could also be an indication that you have been too
ambitious in your planning in terms of what amount of new or revised
code of acceptable quality you expected to be able to implement in what
time frame, and consequently you have delivered code that is heavily
bugged, at a late enough point in the development cycle that you
immediately wind up on a 'blocker treadmill' just fixing all the bugs in
the code you just delivered. I don't think it's controversial to say
this has been known to be a problem in the world of software development
before :)

> 2) Installer blockers should only be granted when there is no other way to
> accomplish the same task during installation.  For example, if FCoE
> configuration does not always work in the UI but does work when passed boot
> parameters or via kickstart, we shouldn't consider it a blocker.  It's an
> unfortunate bug, but as described there is an install path for those users.

In practice we do and always have considered workarounds in evaluating
blocker status for bugs. This isn't brilliantly called out in the
criteria pages, I admit, and we should improve that. The section
'(release) Blocker Bugs', right below the criteria, could really do with
some adjustment.

It's hard to be more precise than this because workaround evaluation is
one part of the blocker review process that more or less inevitably
continues to involve subjectivity, and it's very much a bug-by-bug
thing. But obviously, the more severe and more commonly-encountered the
issue, the less likely we usually are to accept 'there's a workaround'
as a reason not to take it as a blocker. The ease of the workaround and
the likelihood of a user thinking of it themselves - or at least
figuring that there _might_ be a workaround, and they should go and look
for one - are also taken into consideration.

So...we do consider workarounds. And yes, this should be explained more
clearly in the process documentation, we'll address that. I don't think
we should accept your principle - "Installer blockers should only be
granted when there is no other way to accomplish the same task during
installation." - as solidly as it's stated, though, as it removes too
much flexibility in the evaluation process. 

To give a competing example, in anaconda 18.13 there is a bug in the new
partitioning process - I call it 'guided partitioning', the dialog which
attempts to help you free up space on a full disk, by deleting or
shrinking partitions - which causes it to crash when trying to delete
partitions. But if you go into the 'custom partitioning' interface you
can successfully delete partitions. So by your principle, we would not
take that bug as a release blocker. I don't think that would be a good
decision: we should not release an installer which crashes when you try
to follow the path you're guided to, for freeing up space to install the
operating system. 'Don't do what the installer recommends you to do,
instead go into this advanced process that's supposed to be for experts'
is a workaround, and hence satisfies your requirement for not-a-blocker,
but I really don't think it's a good story to tell people in the case of
such a critical bit of functionality.

It also has a clear negative effect on the very problem we're discussing
here: the broken code cannot get any testing. All we can know about a
codepath that's broken, but for which we accepted a workaround that
dodges the broken codepath, is 'it's completely broken'. If the broken
codepath is not treated as a blocker and fixed rapidly, we cannot test
it 'beyond' the blocker bug. There might be five further blockers behind
that bug, or just _regular_ bugs ('the UI sucks', 'it doesn't offer to
let me resize a partition it should have done', 'it prints a bogus error
message when I delete a partition'...all those kinds of perfectly normal
bugs), but if we take a workaround and called it 'not a blocker' it gets
dropped in priority, likely doesn't get fixed for weeks, and when it
turns out there's five other bugs 'behind' the showstopper...well, they
go on your treadmill. =) Accepting workarounds too readily actually
_impedes_ our ability to find other bugs swiftly.

> 3) Ultimately we want the number of granted blockers to be lower and lower
> from alpha to beta to rc.

I understand the motivation behind this, and I think it's a goal we can
attempt to address by ensuring comprehensive testing is done early (and
a goal you can help to address by ensuring major code changes land early
enough to be tested, and budgeting time and resources for fixing the
bugs that will _inevitably be present_ in any large chunk of new code).
But I don't think 'make it harder for a bug to qualify as a blocker the
later we get in the release process' is a good thing to do, even though
it would help to achieve this goal. To me it looks like a process hack
which would ultimately damage the quality of our releases. It's actually
something that we've specifically tried *not* to do in the blocker
review process, since it was implemented. We have very intentionally
attempted to review bugs 'impartially', treating blocker status as
something a bug either should have or should not have on its own merits,
and attempting not to take into account things that strictly should not
be taken into account, like 'is there a fix already?' or 'how close are
we to release?'

Thanks again for your thoughts!
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | identi.ca: adamwfedora
http://www.happyassassin.net