MirrorManager outage root cause
Matt_Domsch at dell.com
Thu Jan 31 05:41:39 UTC 2013
At approximately 01:00 UTC today, clients requesting the mirror list
started getting timeouts, then HTTP 503 errors generated by the
MirrorManager mirror list processes. On the Fedora Infrastructure
application servers, the loads spiked, the out-of-memory killer
started firing, and chaos ensued.
Proximate cause of this failure appears to be due to invalid data in
the MirrorManager database - specifically, the bandwidth value for
several servers was NULL, when that should not be possible. I say
proximate, not root, as I have not been able to validate the incorrect
behavior with incorrect data, though after fixing the invalid data, we
have not seen further failures. That remains to be done.
There are fixes in the MirrorManager 1.4 (unreleased) branch to
prevent invalid data from happening, but these were not present in the
1.3 version currently in production. Additional fixes have been put
into the MM 1.4 branch tonight to further ensure this type of invalid
data cannot affect the mirrorlist_server process.
Thanks to Stephen Smoogen and Kevin Fenzi for their quick work to
identify the failing systems and minimize the impact to other Fedora
More information about the infrastructure