Background: Shortly after we started scaling out to remote sites we noticed some parts of some of our applications had issues over the vpn link. Initial ticket creation was a bit off as to what we were looking at but:
https://fedorahosted.org/fedora-infrastructure/ticket/281
To try to keep this short: after lots of tests we discovered that applications that run lots of queries are the core of our issue. We have ok bandwidth to all of our sites, but latency is high enough that it's become too expensive to actively run applications at these sites. Every query that gets run at a remote site seems to take a minimum of .3 to .5 seconds for the complete round-trip. As we mature and as features get in lots of our apps need more queries. We can and should go through and make these more efficient but that's going to happen over a long time. We just don't have the number of people we need to do trends on each page of each application and convert all the sql to its most efficient.
Instead we're going to convert all of our remote application servers to passive/backup servers. Up until now we've generally been using our remote sites to scale load. Now though we can't really do that. They're an important role for being fairly HA (our SPOF is still our data layer). Having a multi-master data layer of postgres and mysql just won't be a win for us at our size at this time.
So what does this mean for the future? Our scaling issues at our app layer will just have to be in a centralized location. This won't scale forever but I think for the near and middle term in Fedora's future it's what we're going to have to bank on. We can continue to focus on better caching at our proxy layer which will continue to be active at each remote site.
All of what I've written here is probably very obvious to most of you, and it is. The difference now is we have some much better data concerning the interaction between our app servers and the data layer and better metrics for how long those interactions take. So darnit, I'm not going to spend all the time I did with testing and metrics and not write some long summary about it! :)
-Mike