Greetings.
I think we have some lessons learned and things we could improve based
on the issues we ran into yesterday on the mass reboot/updates. ;)
Issues/Observations:
* We can't seem to complete everything in a 2 hour window. We should
block out more time, and/or have things more organized.
* The build system /mnt/koji issues are due to a guest that was moved
from one machine to a new one, and then somehow started on both after
reboots. ;(
* There were a few cases of too many cooks doing things at once to a
machine.
* Some physical machines were poorly or not at all labeled in things
like pdu's and serial consoles.
* We need to be better about retiring machines. Sometimes it's hard to
see what shouldn't be up or should be.
Ideas/improvements:
* I'd like to look at splitting all our hosts into 3 groups (based on
who we need to notify about reboots or outages):
a) End users will see/notice an outage if this machine is down/not
working.
b) Fedora package maintainers or contributors will notice if this
machine/service is not working/down.
c) Everything else. Including things that if they were single instances
would fit in the above, but are spread out, so they can be
rebooted/updated one at a time (ie, app servers, etc).
I've made a tenative list with all our hosts in these groups:
~kevin/mass-reboot-list on puppet01. Please look and see if you see
anything that looks wrong or needs adjusting.
With this split out, we can do any machines in "c" as we like as long
as we are careful, we can do 'b' machines if we announce to
devel-announce and schedule a window and 'a' machines if we announce to
the main fedora announce list and schedule a window. All the windows
should be shorter than what we saw yesterday.
* We might look at having a updates miester (czar?:) who would be the
only one allowed to touch machines in a read/write way. By default
everyone else is hands off unless the updates miester asks them to
work on something. This would allow us to not interfere with each
other or duplicate effort.
* Seth is working on tooling to tell us anytime we have a virtual
machine thats set to start on boot, but not started now, or not set
to start on boot but started now.
* We need to go and label things in all the pdu's etc. I can look at
doing that and writing up a file somewhere with all the places a
particular machine is. Then, the ones we can't find, we will fill in
when smooge and I are out at phx2.
* I have started a SOP for retiring machines. It needs a lot of work:
https://fedoraproject.org/wiki/Infrastructure_retire_machine_SOP
please modify and clean up. The goal should be making it very clear
when a machine has been retired so we don't confuse it with anything
active.
There is a new rhel5 kernel out (yes, right after we applied the last
one yesterday.), so I would suggest we look at implementing some or all
of these that make sense for those updates. ;)
Thoughts? Rants? more suggestions?
kevin