pkgdb2 post-mortem and strategy for future deployments

Wed Jun 4 18:44:54 UTC 2014

This came up in a different venue and pingou and I have continued to talk
about it.  Seemed that this was the right place to bring the discussion
though.

Some observations:

* Pkgdb2 and a call for testing in staging was announced well in advance of
  the deployment to production (good) but not everyone understood that we
  were going to be breaking API (bad).

* There were people inside of fedora infrastructure and outside of
  infrastructure who were surprised by the API break.  There were also some
  community members and infrastructure members who heeded the call for
  testing and both gave feedback and ported before the deployment.

* There was a FAS2 update that pkgdb2 depended upon.  That was also pending
  in stg for a long time and also had some minor API changes (IIRC, all
  unintentional.  I hotfixed one of them that was simply a bug last week).
  These also caused issues for some scripts.

* Unexpected problems: we had things that we didn't know used the pkgdb API,
  things that weren't tested in stg because stg couldn't replicate that part
  of production, and things that were ported but mistakes caused the ported
  scripts to not be deployed or to point at stg instead of production.
  I saw that we had the right people on IRC throughout the day working on
  analyzing and patching all of the broken things so. However, this was
  somewhat by accident and some of those people were surprised that they
  spent their day doing this.

Some ideas for doing major deployments in the future:

1: We have to make people aware when a new deployment means API breaks.
  * Be clear that the new deployment means API breaks in every call for
    testing.  Send announcements to infrastructure list and depending on the
    service to devel list.
  * Have a separate announcement besides the standard outage notification
    that says that an API breaking update is planned for $date
  * When we set a date for the new deployment, discuss it at least once in
    a weekly infrastructure meeting.
  * See also the solution in #3 below

2: It would be really nice for people to do more testing in stg.
  * Increase rube coverage.  rube does end-to-end testing so it's better at
    catching cross-app issues where API changes better than unittests which
    try to be small and self-contained
    - A flock session where everyone/dev in infra gets to write one rube
      test so we get to know the framework
  * Run rube daily
    - Could we run rube in an Xvfb on an infrastructure host?
  * Continue to work towards a complete replica of production in the stg
    environment.

3: "Mean time to repair is more important than mean time between failure."
   It seems like anytime there's a major update there's unexpected things that
   break.  Let's anticipate the unexpected happening.
  * Explicitly plan for everyone to spend their day firefighting when we
    make a major new deployment.  If you've already found all the places
    your code is affected and pre-ported it and the deployment goes smoothly
    then hey, you've got 6 extra working hours to shift back to doing other
    things.  If it's not smooth, then we've planned to have the attention of
    the right people for the unexpected difficulties that arise.
  * As part of this, we need to identify people outside of infrastructure
    that should also be ready for breakage.  Reach out to rel-eng, docs, qa,
    cvsadmins, etc if there's a chance that they will be affected.

4: Related to the FAS release: Buggy code happens.  How can we make it
   happen less?
  * More unittests would be good however we know from experience with bodhi
    that unittests don't catch a lot of things that are changes in behaviour
    rather than true "bugs".  Unexpected API changes that cause people
    porting pain can be as simple as returning None instead of an empty list
    which causes a no-op iteration in running code to fail while the
    unittests survive because they're checking that "no results were
    returned".
   * Pingou has championed making API calls and WebUI calls into separate
     URL endpoints. I think that coding style makes it easier to control
     bugs related to updating the webui while trying to preserve the API so
     we probably want to move to that model as we move onto the next major
     version of our apps.
   * Not returning json-ified versions of internal data structures (like
     database tables) but instead parsing the results and returning
     a specific structure would also help divorce internal changes from
     external API.

What should we apply this to?
* Probably can skip if:
  - Things that we don't think have API breaks
  - Things that are minor releases (hopefully these would correlate with not having API breaks :-)
  - Leaf services that are not essential to releasing Fedora.
    + ask, nuancier, elections, easyfix, badges, paste, nuancier
    + There's a lot of boderline cases too -- is fedocal essential enough
      to warrant being under this policy?  Since the wiki is used via its
      API should that fall under this as well?

Comments, thoughts, other ideas?

Do we need to "ratify" something like this at a meeting?

What's the next app deploy where we'll want to enact this?
Maybe bodhi2 ;-)?

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20140604/3ef0f83e/attachment.sig>