Outage notes

Wed Jan 12 15:06:58 UTC 2011

Hi Everyone,
 I took some notes while we were rebooting boxes I wanted to share them
with everyone for future outages.

Ordering of the bounces:
1. xen14: puppet is on there and if that is back up first we have a
place to stand for pushing out any changes (dns changes for example via
puppet) - xen14 takes about 4 minutes to restart/POST

2. xen15: bastion01, db02 are on there - same 4 minute restart window
   once this is up you'll want to logout of bastion02 and into bastion01
so you have a firm place to do the xen05 reboots from which will take
out bastion02.

3. edit dns on puppet to remove proxy01 from the wildcard/roundrobin and
push that to the ns* servers and verify. 

4. xen05: bastion02 (openvpn), proxy01 - 4-5 minutes for this machine to
restart.

once xen05 is completely up log back in and verify the vpn is back
online

5. edit dns remove all the other proxy hosts and put proxy01 back in.
Push and verify

6. virthost01 - I had to halt each of the kvms from a login - virsh
shutdown didn't work. - 4 minute restart time on the hw.
Note: make sure virthost01 is completely up - especially fas03. since
taking down virthost02 next will take out fas02-  you want to make sure
you don't leave fas01 all by itself.

7. virthost02 - fas02 was not setup to autostart - that's now fixed.

8. virthost13 - uneventful

9. xen03 - spin01 spewed lots of umount issues - those are from the spin
creation paths - they can be safely ignored
         - fas01.stg was running on xen03 according to the logs but
there's no definition for it on the system - so not sure what the story
is there.
         - neither of the other staging hosts were set to autostart

10. xen04: we apparently have a number of hosts w/only one dns record
internally and they point to ns03 only. B/c when ns03 went away - lots
of things got VERY VERY SLOW trying to resolve names. This is on my list
to address. You must wait for xen04 to be completely up and ns03 running
before you can take down xen07. Otherwise we'll be w/o dns internally to
phx.

11. xen07: iscsi disks didn't come up right away - this kept ns04 from
coming up immediately - needed to run /etc/init.d/iscsi start and they
showed up.

12. xen09: uneventful

13. xen10: log01 needed an fsck b/c of the time since last mount - this
took a long time.

14. xen11: secondary1 needed an fsck. also a 5-6 minute hw reboot time.

15. xen12: db1->db01 naming change kept it from coming up at boot b/c of
the 'auto' symlink to db1. db01 had to fsck

16. cnode01 - 6-7 minute reboot time - nothing was set to autostart in
xen - this is now fixed - autoqa01 and dhcp02 are set to autostart

17. db03: fsck took FOREVER to complete and this takes a lot of things
done - for the future move db03 reboot higher up the stack, just in
case. This machine's restart/POST time is REALLY high like 7-10 minutes.
The console for it is less than forthcoming, too.

18. backup01: uneventful

At this point internal was back online - except for the build xen
systems and servers.

External hosts:

19. - bodhost01: 5-6 minute machine reboot time
   - people01 - uneventful.
   - ibiblio01 - 5-7 minute machine reboot time. uneventful
20. - internetx01: uneventful
   - osuosl01: uneventful
21. - sb2 - must wait for ibiblio01 to be up b/c of not having any
external name servers
   - sb3 - uneventful
   - sb4 - hosted1 listed more 'maxmem' in its config that sb4 had
available - so that had to be edited down. Not sure how that EVER
started
   - sb5 - uneventful
22. telia01 - proxy5 did not restart on its own - unknown as to WHY yet
- but it did start manually.
           - retrace01 was not set to autostart
   tummy1 - uneventful

Now all the proxy* rebooting is over so we can:

23.  edit dns: put the other proxy hosts in the wildcard/RR - push and
verify

Build boxes: 
  - bxen03 had koji2 listed in its set of hosts - but it wasn't running.
This led to some confusion as to how to start the hosts on  bxen03 b/c
of insufficient memory for all guests. Eventually I realized bxen04 is
where koji02 was running and that the left over guest file was never
cleaned up on bxen03.

Things to think about post-outage:
 - check all the raid arrays for lost disks - we saw this a couple of
times - it's not pleasant.
 - check for downed vpns and/or broken resolution - we need to get a
firm handle on why this is a hassle so often.

Overall things to think about for the future:
1. dumping a complete virsh list - including how much memory is actually
being used per vm per server before we start reboots
2. checking what disks need fscks because of mounted time and doing
those earlier or separately.
3. verifying that all running vms are:
   a. intended to be running
   b. have a config file
   c. are set to autostart
4. verifying that all NOT running vms are:
   a. intended to be off
   b. are NOT set to autostart

thoughts welcome.
-sv