Mass reboots/updates improvements

Wednesday, 1 June 2011

Greetings. 

I think we have some lessons learned and things we could improve based
on the issues we ran into yesterday on the mass reboot/updates. ;) 

Issues/Observations: 

* We can't seem to complete everything in a 2 hour window. We should
  block out more time, and/or have things more organized. 

* The build system /mnt/koji issues are due to a guest that was moved
  from one machine to a new one, and then somehow started on both after
  reboots. ;( 

* There were a few cases of too many cooks doing things at once to a
  machine. 

* Some physical machines were poorly or not at all labeled in things
  like pdu's and serial consoles. 

* We need to be better about retiring machines. Sometimes it's hard to
  see what shouldn't be up or should be. 

Ideas/improvements: 

* I'd like to look at splitting all our hosts into 3 groups (based on
  who we need to notify about reboots or outages): 

a) End users will see/notice an outage if this machine is down/not
working.

b) Fedora package maintainers or contributors will notice if this
machine/service is not working/down. 

c) Everything else. Including things that if they were single instances
would fit in the above, but are spread out, so they can be
rebooted/updated one at a time (ie, app servers, etc). 

I've made a tenative list with all our hosts in these groups: 
~kevin/mass-reboot-list on puppet01. Please look and see if you see
anything that looks wrong or needs adjusting. 

With this split out, we can do any machines in "c" as we like as long
as we are careful, we can do 'b' machines if we announce to
devel-announce and schedule a window and 'a' machines if we announce to
the main fedora announce list and schedule a window. All the windows
should be shorter than what we saw yesterday. 

* We might look at having a updates miester (czar?:) who would be the
  only one allowed to touch machines in a read/write way. By default
  everyone else is hands off unless the updates miester asks them to
  work on something. This would allow us to not interfere with each
  other or duplicate effort. 

* Seth is working on tooling to tell us anytime we have a virtual
  machine thats set to start on boot, but not started now, or not set
  to start on boot but started now. 

* We need to go and label things in all the pdu's etc. I can look at
  doing that and writing up a file somewhere with all the places a
  particular machine is. Then, the ones we can't find, we will fill in
  when smooge and I are out at phx2. 

* I have started a SOP for retiring machines. It needs a lot of work: 
https://fedoraproject.org/wiki/Infrastructure_retire_machine_SOP
please modify and clean up. The goal should be making it very clear
when a machine has been retired so we don't confuse it with anything
active. 

There is a new rhel5 kernel out (yes, right after we applied the last
one yesterday.), so I would suggest we look at implementing some or all
of these that make sense for those updates. ;) 

Thoughts? Rants? more suggestions?

kevin

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006