pkgdb2 post-mortem and strategy for future deployments
Toshio Kuratomi
a.badger at gmail.com
Wed Jun 4 18:44:54 UTC 2014
This came up in a different venue and pingou and I have continued to talk
about it. Seemed that this was the right place to bring the discussion
though.
Some observations:
* Pkgdb2 and a call for testing in staging was announced well in advance of
the deployment to production (good) but not everyone understood that we
were going to be breaking API (bad).
* There were people inside of fedora infrastructure and outside of
infrastructure who were surprised by the API break. There were also some
community members and infrastructure members who heeded the call for
testing and both gave feedback and ported before the deployment.
* There was a FAS2 update that pkgdb2 depended upon. That was also pending
in stg for a long time and also had some minor API changes (IIRC, all
unintentional. I hotfixed one of them that was simply a bug last week).
These also caused issues for some scripts.
* Unexpected problems: we had things that we didn't know used the pkgdb API,
things that weren't tested in stg because stg couldn't replicate that part
of production, and things that were ported but mistakes caused the ported
scripts to not be deployed or to point at stg instead of production.
I saw that we had the right people on IRC throughout the day working on
analyzing and patching all of the broken things so. However, this was
somewhat by accident and some of those people were surprised that they
spent their day doing this.
Some ideas for doing major deployments in the future:
1: We have to make people aware when a new deployment means API breaks.
* Be clear that the new deployment means API breaks in every call for
testing. Send announcements to infrastructure list and depending on the
service to devel list.
* Have a separate announcement besides the standard outage notification
that says that an API breaking update is planned for $date
* When we set a date for the new deployment, discuss it at least once in
a weekly infrastructure meeting.
* See also the solution in #3 below
2: It would be really nice for people to do more testing in stg.
* Increase rube coverage. rube does end-to-end testing so it's better at
catching cross-app issues where API changes better than unittests which
try to be small and self-contained
- A flock session where everyone/dev in infra gets to write one rube
test so we get to know the framework
* Run rube daily
- Could we run rube in an Xvfb on an infrastructure host?
* Continue to work towards a complete replica of production in the stg
environment.
3: "Mean time to repair is more important than mean time between failure."
It seems like anytime there's a major update there's unexpected things that
break. Let's anticipate the unexpected happening.
* Explicitly plan for everyone to spend their day firefighting when we
make a major new deployment. If you've already found all the places
your code is affected and pre-ported it and the deployment goes smoothly
then hey, you've got 6 extra working hours to shift back to doing other
things. If it's not smooth, then we've planned to have the attention of
the right people for the unexpected difficulties that arise.
* As part of this, we need to identify people outside of infrastructure
that should also be ready for breakage. Reach out to rel-eng, docs, qa,
cvsadmins, etc if there's a chance that they will be affected.
4: Related to the FAS release: Buggy code happens. How can we make it
happen less?
* More unittests would be good however we know from experience with bodhi
that unittests don't catch a lot of things that are changes in behaviour
rather than true "bugs". Unexpected API changes that cause people
porting pain can be as simple as returning None instead of an empty list
which causes a no-op iteration in running code to fail while the
unittests survive because they're checking that "no results were
returned".
* Pingou has championed making API calls and WebUI calls into separate
URL endpoints. I think that coding style makes it easier to control
bugs related to updating the webui while trying to preserve the API so
we probably want to move to that model as we move onto the next major
version of our apps.
* Not returning json-ified versions of internal data structures (like
database tables) but instead parsing the results and returning
a specific structure would also help divorce internal changes from
external API.
What should we apply this to?
* Probably can skip if:
- Things that we don't think have API breaks
- Things that are minor releases (hopefully these would correlate with not having API breaks :-)
- Leaf services that are not essential to releasing Fedora.
+ ask, nuancier, elections, easyfix, badges, paste, nuancier
+ There's a lot of boderline cases too -- is fedocal essential enough
to warrant being under this policy? Since the wiki is used via its
API should that fall under this as well?
Comments, thoughts, other ideas?
Do we need to "ratify" something like this at a meeting?
What's the next app deploy where we'll want to enact this?
Maybe bodhi2 ;-)?
-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20140604/3ef0f83e/attachment.sig>
More information about the infrastructure
mailing list