Non-image blocker process change proposal

Fri Nov 20 12:16:23 UTC 2015

> > My suggestion would be that we make sure 'blockerbugs' includes
> > lists of each type of blocker. Ahead of and at Go/No-Go meetings,
> > we would want to have a formal assurance from the person
> > responsible for fixing the bug that the fix would be provided by a
> > certain time - say, one day or two days ahead of the release date -
> > and it would be QA's responsibility to ensure the updates are
> > tested promptly, and releng's responsibility to ensure they are
> > pushed on time after being tested. I would suggest the Program
> > Manager ought to have overall responsibility for keeping an eye on
> > the 0Day and Stable blocker lists and making sure the maintainer,
> > QA, and releng all did their jobs on time.
> 
> The biggest issue is this, I think. We probably need to encode
> "Special Blockers" into the Go/No-Go process. I don't think that
> assurance that it will be fixed on time is necessarily good enough.
> Particularly given the time that it takes stable updates to make it to
> the mirrors, I'd say that we probably want to say that any such
> special blockers have to be queued for stable before the Go/No-Go
> decision is made. (This may in some cases mean *during* the Go/No-Go
> meeting, of course.)

Well, here's our latest mess-up:
https://bodhi.fedoraproject.org/updates/FEDORA-2015-e00b75e39f
dnf-plugin-system-upgrade-0.7.0-1.fc22 had enough karma for stable on Oct 29, which was Go/No-Go day. Therefore it was considered "resolved". However, it was pushed to testing on Nov 2 (4 days later) and to stable on Nov 5 (5 days later!), which was the public release day. Since mirrormanager is configured to serve even last-but-one metadata (i.e. even 1-2 days old, relengs can provide a more precise value), many of our users upgraded on Nov 5 and Nov 6 using an older version of system-upgrade which broke their systems. Just read the comments:
https://fedoramagazine.org/upgrading-from-fedora-22-to-fedora-23/#comments
I was very unhappy. We solved most of the issues, it was a lot of work, and yet a large group of people was hit by those old, long-resolved problems, just because of bad timing and slow repo pushes (for whatever reason).

So, that update was "queued for stable before the Go/No-Go" as you proposed, and yet we have failed to deliver it. So if we really want to avoid such problems in the future, we either need to insist on "pushed to stable by Go/No-Go, no exceptions", or we need to have another check on release day and verify that all required builds were pushed to stable at least 2 days before -- if not, do not announce the release and wait for more days. The first approach is slightly impractical (we don't want to wait another week, it might be resolved in 2 more days; do we lift final freeze or not?), the second approach is confusing for media (media announce we're Go, and then nothing happens on the proclaimed release day).

What I see as a potential solution here is decoupling tasks that need to wait for the 0day blockers and those which don't. So, at the Go/No-Go meeting, we can decide that it is No-Go in general, but composes are final now and can be uploaded to proper locations for mirrors to pick them up. I don't know exactly what else relengs need to do, but I guess there will be other tasks that can be done. And in 2-3 days, we can have Go/No-Go again, where we decide that even 0day blocker have been addressed, pushed to stable, and we can pronounce the whole release Go, and publish the announcement immediately or the next day or whatever's appropriate (bearing in mind that there should be 2 days period after the 0day blockers are pushed stable).

WDYT? Reasonable? Complicated? Bonkers? Off the mark?