Lessons learned: Initial check/step list for updates
Stephen John Smoogen
smooge at gmail.com
Fri Sep 10 01:47:22 UTC 2010
0. Plan time in infrastructure at lists.fedoraprojects.org
1. Open ticket on infrastructure for downtime.
Updates will occur during day
Reboots will occur during evening
2. Send email to devel-announce, announce, infrastructure
3. Update servers during working hours and work out issues in ticket.
** releng updates the following boxes:
cvs01, pkgs01, nfs01, bnfs01, bxen*,
x86-*, ppc*, koji*, db03, xb-01,
compose-*, sign-vault01
4. Change DNS to turn off proxy on bodhost01 (or similar external
proxy server).
5. Reboot bodhost01
6. Confirm proxy is working on bodhost/fix issues.
7. Change proxy dns to only go to bodhost01
8. Turn off nagios for servers.
9. Turn off nagios-external for services.
10. Reboot order counts
11. releng deals with the boxes listed above unless told otherwise.
12. reboots with database servers first
xen15: db02
xen12: db01
13. reboot PHX2 boxes
xen03:
xen04:
xen06:
xen07:
xen09:
xen10:
xen11:
xen13:
backup01:
14. reboot Outside boxes (can be in parallel to PHX2)
cnode01:
cnode02:
cnode03:
ibiblio01:
internetx01:
osuosl01:
people01:
serverbeach1:
serverbeach2:
serverbeach3:
serverbeach4:
serverbeach5:
telia1:
tummy1:
15. reboot bastion.fedoraproject.org
log into bastion1 from outside system
log into bastion2 from outside world
log into xen05 from bastion01
bastion01:
sudo su /usr/sbin/puppetd --disable
sudo su /sbin/service openvpn start
bastion02
sudo su /sbin/service openvpn start
xen05
sudo /sbin/shutdown -r now
once xen05/bastion2 server is back up, we can
bastion01:
sudo su /sbin/service openvpn stop
sudo su /usr/sbin/puppetd --enable
16. reboot puppet01
log into bastion2 from outside world
ssh xen14
sudo /sbin/shutdown -r now
17. re-enable DNS for proxy servers
test proxy servers from puppet01
edit dns in git puppet
make ns1
18. re-enable nagios on internal/external
19. Setup transifex agent on app servers: app01 app02 app03 app04 app07
sudo -u transifex /var/lib/transifex/ssh-add.sh -f
20. Log and report problems to list.
21. Close ticket.
--
Stephen J Smoogen.
“The core skill of innovators is error recovery, not failure avoidance.”
Randy Nelson, President of Pixar University.
"We have a strategic plan. It's called doing things.""
— Herb Kelleher, founder Southwest Airlines
More information about the infrastructure
mailing list