Lessons learned: Initial check/step list for updates

Stephen John Smoogen smooge at gmail.com
Fri Sep 10 01:47:22 UTC 2010


 0. Plan time in infrastructure at lists.fedoraprojects.org
 1. Open ticket on infrastructure for downtime.
      Updates will occur during day
      Reboots will occur during evening
 2. Send email to devel-announce, announce, infrastructure
 3. Update servers during working hours and work out issues in ticket.
   ** releng updates the following boxes:
       cvs01, pkgs01, nfs01, bnfs01, bxen*,
       x86-*, ppc*, koji*, db03, xb-01,
       compose-*, sign-vault01
 4. Change DNS to turn off proxy on bodhost01 (or similar external
      proxy server).
 5. Reboot bodhost01
 6. Confirm proxy is working on bodhost/fix issues.
 7. Change proxy dns to only go to bodhost01
 8. Turn off nagios for servers.
 9. Turn off nagios-external for services.

10. Reboot order counts
11.   releng deals with the boxes listed above unless told otherwise.
12.   reboots with database servers first
        xen15: db02
        xen12: db01
13.   reboot PHX2 boxes
        xen03:
        xen04:
        xen06:
        xen07:
        xen09:
        xen10:
        xen11:
        xen13:
        backup01:
14.   reboot Outside boxes (can be in parallel to PHX2)
        cnode01:
        cnode02:
        cnode03:
        ibiblio01:
        internetx01:
        osuosl01:
        people01:
        serverbeach1:
        serverbeach2:
        serverbeach3:
        serverbeach4:
        serverbeach5:
        telia1:
        tummy1:
15.   reboot bastion.fedoraproject.org
        log into bastion1 from outside system
        log into bastion2 from outside world
        log into xen05 from bastion01
        bastion01:
          sudo su /usr/sbin/puppetd --disable
          sudo su /sbin/service openvpn start
        bastion02
          sudo su /sbin/service openvpn start
        xen05
          sudo /sbin/shutdown -r now
        once xen05/bastion2 server is back up, we can
        bastion01:
          sudo su /sbin/service openvpn stop
          sudo su /usr/sbin/puppetd --enable
16.   reboot puppet01
        log into bastion2 from outside world
        ssh xen14
           sudo /sbin/shutdown -r now
17.   re-enable DNS for proxy servers
        test proxy servers from puppet01
        edit dns in git puppet
        make ns1
18.   re-enable nagios on internal/external
19. Setup transifex agent on app servers: app01 app02 app03 app04 app07
     sudo -u transifex /var/lib/transifex/ssh-add.sh -f
20. Log and report problems to list.
21. Close ticket.







-- 
Stephen J Smoogen.
“The core skill of innovators is error recovery, not failure avoidance.”
Randy Nelson, President of Pixar University.
"We have a strategic plan. It's called doing things.""
— Herb Kelleher, founder Southwest Airlines


More information about the infrastructure mailing list