Jeffrey/All:
I've been corresponding off-list with Elliot with regard to upgrading
db1. At his suggestion, I'm going to run my proposed process by
everyone for feedback. I went through a production DB server upgrade a
while back and this process reflects what I did then and lessons learned
during the process.
I also had Slony in this mix, which resulted in a few more steps. From
what I've read on the wiki pages regarding the current Fedora
infrastructure setup, there is only 1 DB server, a SPOF. It may be
worth the time to start a separate conversation about some sort of
replication or clustered setup, which a beast unto itself.
I have a master/slave replication setup with Slony which works
quite well. You still have the SPOF with the master but at least you
can load balance between the slaves and still have read access going if
the master goes down. With all of the research I've done, I've not
found anything which is production ready and allows a multi master
scenario for Postgres, only master/slave scenario with replication, like
Slony. The closest thing I found was PGCluster but it was very flaky,
at least on 8.x, the last time I played with it late last year.
PGCluster was also very hardware intensive, meaning it takes like 4
machines to run a setup without a SPOF.
The following assumes the desire for as minimal as possible downtime and
as a result is a little more involved than if more downtime were
tolerable. I welcome comments and refinements to the process.
Now, onto the upgrade process.
*** Make a DB and filesystem backup ***
Ensure that all necessary config files are archived so they can be
quickly reinstalled via kickstart later
1) Setup all apps to connect to DB using a host alias in /etc/hosts
2) Setup another server as a temporary DB server
Note: An option here is to copy data to the temp DB server and do an
immediate cut over to the temp server after the data has been copied but
you run the risk of data being out of sync between the master and temp
server. Danger, Will Robinson!
3) Temporarily disable apps hitting DB
4) pg_dump DBs and template0, users, groups, etc over to temp server
5) Test to ensure DB is up and accepting connections
6) Change the alias entry in /etc/hosts on hosts running apps to the
record for the temp DB server
7) Re-enable apps
8) Re-install and configure OS on old DB server via kickstart
9) Temporarily disable apps hitting DB
10) pg_dump DBs and template0, users, groups, etc over to master server
11) Test to ensure master DB is up and accepting connections
12) Change the alias entry in /etc/hosts on servers running apps back to
the record for the master DB server
There, my 12 step program for a DB upgrade. :-)
I had a few more Slony related steps which required shuffling the app
servers between Slony slaves and the master but the above it basically
the process.
I made a few decisions/assumptions during my process.
In my case since 99% of DB hits were reads,I still had my Slony slaves
accepting read requests. I was able to do this since I maintain both
read and write DB handles in the SoftSwitch. Some apps allow a
"degraded" mode where data is "read only" just for this sort of thing,
I'm not sure if this is the case here or not.
During the time I disabled the apps to copy over the DB to the temp
server, users just couldn't login to the web portal to change their
settings, which are written to the master, for a few minutes. I had to
ask, would this really be an issue for 5 min at 3am CST on a Sunday
morning? (assuming that the DBs can be copied in the span of 5 min, I
have a VLAN'd GigE management network for this sort of thing) The same
downtime was experienced when I switched back over to the master DB.
Since I'm assuming we're upgrading from Postgres 7.x we'll definitely
want to do a pg_dump/pg_restore since there are some inherent
differences in the data structures on the disk. In the past I've just
stopped postgres, copied over /var/lib/pgsql to the new server and
started postgres there, and called it good but you can't do that when
upgrading from 7.x to 8.x. Since the data itself has to be migrated
from the 7.x format to the 8.x format at some point, there is probably
going to have to be some measure of downtime.
Otherwise you get into having to compare data in 2 different databases
to see if anything changed and manually replicating those changes to the
master copy. For me, the 5 min of partial downtime was _much_ cheaper.
I look forward to discussion on this.
Cheers,
-Curt