Fedora 19 status is ALIVE, GA on July 02, 2013

Fri Jun 28 22:20:09 UTC 2013

On Fri, 2013-06-28 at 15:42 -0600, Chris Murphy wrote:

> But the final release series of RC's happen very quickly, and any
> allowed change is by definition significant (i.e. necessary) or it
> simply wouldn't happen, but that also makes the change higher risk
> than other changes. So I think more time padding is needed between an
> RC and go/nogo.

I think you may be labouring under a bit of a misapprehension about what
should be tested, here. The distinction between a TC and an RC is not
large. An RC can only happen after freeze and must have all blockers
fixed: if a build after freeze doesn't have all blockers addressed, we
call it a TC.

We have gotten better at finding blockers earlier in recent releases.
What this means is that we're doing fewer but _better_ RCs. Back around
F14 or F15, our first 'RC' build was pretty much a joke; there was never
any chance it was actually going to get released. We'd find five
blockers in it straight away. I think in the last few release cycles,
though, we've actually released RC1 or RC2 several times.

I don't really see there being some big distinction between TCs and RCs.
If you want to make sure some workflow that's important to you is going
to work it's really a good idea to follow the process from TC1, there is
no mileage in jumping in at RC1, that's too late. (Never mind the people
who, every release, seem to jump in and start testing on the day we do
go/no-go and then kick up a fuss about whatever they find...not you, but
it does seem to happen).

That doesn't quite apply to this specific case, as it happens, but it's
an important point to make.

Getting down to specifics: the change that we believe broke this -
trying to re-use an existing EFI system partition if one is present
instead of always creating a new one - went into anaconda 19.30.10. TC6
had 19.30.9, RC1 had 19.30.11; 19.30.10 probably only went into smoke
test builds and we found some problem which necessitated 19.30.11. RC1
came out very early Tuesday morning (06-25) (2am Eastern time). If we
assume this had been a blocker bug (which I still think it probably
wasn't), that gave us about...62 hours to catch it before the sign-off
happened.

That is a pretty short timeframe, indeed. If we want to identify one
specific Thing That Went Wrong here, I would say it's that we probably
shouldn't have taken a moderately significant behaviour change as late
as that. So let's look at that in a bit more detail:

https://bugzilla.redhat.com/show_bug.cgi?id=974543 is the bug that
prompted this change. It was filed on 06-14 (though we'd been aware of
the behaviour for rather longer). It was proposed as a freeze exception
issue by bcl (anaconda developer) on 06-17: that effectively means
anaconda team was of the opinion that they wanted this change to go in.

It was reviewed for freeze exception status on 06-19. The log of the
review meeting is at
http://meetbot.fedoraproject.org/fedora-blocker-review/2013-06-19/f19final-blocker-review-7.2013-06-19-16.01.log.txt . Here are the relevant bits extracted, since it's very short:

18:53:38 <adamw> https://bugzilla.redhat.com/show_bug.cgi?id=974543
18:54:20 <Viking-Ice> dances on the limit of blocker
18:56:37 <kparal> but we should definitely vote on 974543
18:56:48 <kparal> it's proposed and patches are ready
18:57:09 <adamw> +1 on 974543
18:57:20 <jreznik> +1 FE for 974543, seems like bcl wants this one
18:57:59 <tflink> #topic (974543) Anaconda is always creating new efi
system partition
18:58:02 <tflink> #link
https://bugzilla.redhat.com/show_bug.cgi?id=974543
18:58:04 <tflink> #info Proposed Freeze Exceptions, anaconda, NEW
18:58:10 <adamw> tflink: the patches are not sent to anaconda-devel-list
so technically not 'post'ed
18:58:11 <adamw> +1
18:58:20 <kparal> +1 FE
18:58:20 <adamw> this is completely wrong behaviour and ought to be
fixed
18:58:27 <nirik> +1 FE
18:58:31 <Viking-Ice> +1
18:58:35 <dgilmore> +1 FE
18:58:36 <jreznik> +1 FE
18:59:12 <adamw> shame to put it in this late, but otoh our 'multiboot
uefi' story has never worked very well so unlikely to maek things worse
18:59:32 <tflink> proposed #agreed 974543 - AcceptedFreezeException -
This behavior of creating new EFI partitions is not correct and should
be fixed. A tested fix would be considered past freeze
18:59:33 <adamw> at some point we're going to run into the problem of
what to do if there isn't enough space in the esp but we'll burn that
bridge when we get to it
18:59:33 <adamw> ack
18:59:49 <jreznik> ack
18:59:57 <nirik> ack
18:59:59 <Viking-Ice> ack
18:59:59 <tflink> #agreed 974543 - AcceptedFreezeException - This
behavior of creating new EFI partitions is not correct and should be
fixed. A tested fix would be considered past freeze
19:00:00 <handsome_pirate> ack
19:00:20 <kparal> adamw: the files are very small, currently

So it sailed through review with +1s from four QA folks (myself, Kamil,
Tim (implied) and Johann), one from releng (dgilmore) and one from the
program manager (jreznik). As has often been the case lately, no-one
outside of those groups bothered to show up for the meeting. It had an
implied +1 from the anaconda developers due to the fact that they had
proposed it in the first place: we put a fairly high weight on that fact
during review.

I very perfunctorily mentioned that it was somewhat dangerous to poke it
this late, but incorrectly (as it transpired) thought it was unlikely to
make things any worse; I'm pretty sure at the time I just did not think
of the possibility of anything like this bug arising.

The fix was committed to anaconda git one day later:

https://git.fedorahosted.org/cgit/anaconda.git/commit/?h=f19-branch&id=03be63fabad9aa52c7a19c68f289b248aa793bcc

committer David Lehman <dlehman at redhat.com> 2013-06-20 20:01:39 (GMT)

We usually have a second 'gate' on FE issues at the point of composing
an actual release image: the person requesting the compose (which is
usually me) and the person doing it (which is usually dgilmore) tend to
have a chat if either feels that it might now be too late to take one of
the FE fixes safely. But this unfortunately doesn't really apply to
anaconda changes, because they're basically a package deal. In a really
extraordinary case we can go to the anaconda devs and ask them to back
out a change we really don't want, but that's pretty rare and we didn't
consider it in this case. So effectively we were committed to taking
this change the moment it was committed to f19-branch in git.

It's kind of interesting that we didn't get a compose that included the
change for four days after that. Koji tells us that anaconda 19.30.10
was built Mon, 24 Jun 2013 21:50:49 UTC and 19.30.11 Mon, 24 Jun 2013
23:27:23 UTC, so a delay between .10 and .11 wasn't the problem. Instead
the delay was between the change being committed to git and a new
anaconda build happening at all - four days, which is quite a lot for
this late in the release cycle. But most of the explanation for that is
fairly mundane: the weekend. The commit was done in the middle of the
day on a Thursday. I don't recall why a build wasn't done on the Friday:
I think it may have been that we felt TC6 was a build we wanted to get a
full validation run done on, and we wanted the next build to be an RC,
and we had other blocker bugs to fix, so there wasn't felt to be any
urgency to get a new compose out for testing. But for whatever reason,
it wasn't. anaconda team does not work weekends, so the build happened
on the next work day, Monday, and the RC1 compose happened shortly
after.

So it's a bit hard to say that this or that party clearly made a
mistake, but I think it'd be reasonable to say that ultimately 'we' -
QA, releng, anaconda devs - may not have made the best call in deciding
to take that change at a point when we were pretty far along in
pre-release stabilization. But anaconda FEs are always a somewhat tricky
call, and there was a clear and substantial upside to this one (it's
really not right at all to go around creating new ESPs on systems that
already have them, and we were aware of cases where this just messed up
boot, possibly even of *other* OSes). So I don't think it was
egregiously the wrong decision. In practice it does not seem to have
turned out for the best, though.
-- 
Adam Williamson
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | identi.ca: adamwfedora
http://www.happyassassin.net