Unplanned Proxy Outage: - 2011-08-19 16:30 UTC

Sat Aug 20 02:45:45 UTC 2011

Summary of Event
================

Tonight there was an unplanned outage of two proxy servers (proxy01 and
proxy02).  The proxies were unresponsive and needed to be rebooted in order
to come back online.  Proxy01 being down caused a cascade of other issues
that should have had very little end-user impact.  As far as we know, the
applications on admin.fp.o would have been up but appeared very slow and the
wiki would have been up for reading but logging in would have failed.
Explanation to follow.

Proxy01 is the only proxy server that is used for app servers (web apps,
cronjobs, etc) in phx2 that need to talk to our web applications in phx2.
This was setup because the router that handles traffic into and out of phx2
does not allow us to "hairpin", send a request for data from phx2 to an
external ip address that then resolves back to a server in phx2.  As
currently implemented, we have an /etc/hosts entry that points
admin.fedoraproject.org at the internal ip address of phx2.

When proxy01 went down, things in PHX2 that needed to talk to
admin.fedoraproject.org were no longer able to get the data they needed.
For the wiki, this meant that attempting to login during the outage would be
unable to verify the password in fas.  For the TurboGears apps on
admin.fedoraproject.org the situation was worse.  TG1 apps' identity
management depends on visit tracking to work.  Visit tracking hits fas for
every request.  This means that no page could be served for the TG1 apps
from the phx2 app servers.

We have two app servers that reside outside of phx2.  Because of network
latency between these servers and the database server in phx2, these servers
are configured to be backups for the servers in  phx2, not handling requests
unless phx2 is unable to.  The remaining proxy servers detected that the app
servers within phx2 were down and properly switched over to app servers
outside of phx2 so there was no apparent outage for people trying to use
admin.fedoraproject.org, although response time would have been drastically
less.

Looking at the haproxy status page for proxy03 during the outage we noticed
that only one of the two app servers outside of phx2 (app05 at ibiblio) was
handling traffic.  app06 (at telia) was not.  We are not sure why this is.
One possibility is that telia's network latency is just too high so haproxy
decided that app06 was also down and did not pass traffic to it.

Action Items
============

There are some open questions to try to resolve:

* Why did proxy01 and proxy02 die?  A brief look at the logs has not
  revealed a cause for this.
* Why didn't app06 take up any of the slack when haproxy started passing
  traffic to the backups?

We have identified one means of mitigating this in the future:

If we ran internal DNS for phx2 then we could have admin.fedoraproject.org
resolve to different proxy servers (using internal ip addresses for the
proxies inside of PHX2).  This should remove the SPOF on proxy01.  We have
not yet determined whether we'd need to run more proxy servers inside of
PHX2 or if hairpinning would not be an issue if we used proxy servers
outside of phx2.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://lists.fedoraproject.org/pipermail/infrastructure/attachments/20110819/f0552f4f/attachment.bin