El Tue, 20 Mar 2012 14:44:07 -0400
seth vidal <skvidal(a)fedoraproject.org> escribió:
The discussion on devel list about ARM and my work last week on
reinstalling builders quickly and commonly has raised a number of
issues with how we manage our builders and how we should manage them
in the future.
It is apparent that if we add arm builders they will be lots of
physical systems (probably in a very small space) but physical,
none-the-less. So we need a sensible way to manage and reinstall these
hosts commonly and quickly.
Today there is not a way to do an anaconda install on any arm system.
though hopefully we will have that for deployment.
Additionally, we need to consider what the introduction of a largish
number of arm builders (and other arm infrastructure) would do to our
existing puppet setup. Specifically overloading it pretty badly and
making it not-very-manageable.
probably we would be adding 100-300 systems. not only do we need to
consider overloading of puppet, but also logging and monitoring. I
guess its more how do we scale our infrastructure from at a guess ~100
nodes today to 3 to 4 times that
I'm making certain assumptions here and I'd like to be clear
what those are:
1. the builders need to be kept pristine
2. that currently our builders are not freshly installed frequently
3. that the builders are relatively static in their
configuration and most changes are done with pkg additions
4. that builder setups require at least two manual-ish steps of a koji
admin who can disable/enable/register the builder with the kojihub.
5. that the builders are fairly different networking and setup-wise to
the rest of our systems.
So I am proposing that we consider the following as a general process
for maintaining our builders:
1. disable the builder in koji
2. make sure all jobs are finished
3. add installer entries into grub (or run the undefine, reinstall
process if the builder is virt-based)
4. reinstall the system
5. monitor for ssh to return
6. connect in and force our post-install configuration:
identification, network, mount-point setup, ssl certs/keys for koji,
etc 7. reboot
8. re-enable host in koji
We would do this with frequency and regularity. Perhaps even having
some percentage of our builders doing this at all times. Ie: 1/10th of
the boxes reinstalling at any given moment so in a certain time
frame*10 all of them are reinstalled.
honestly we could do this instead of the monthly updates. just rebuild
Additionally, this would mean these systems would NOT have a puppet
management piece at all. Package updates would still be handled
by pushes as we do now, if things were security critical, but barring
the need for significant changes we could rely on the boxes simply
being refreshed frequently enough that it wouldn't need to be pushed.
im ok with that, im pretty sure fas will scale to the extra boxes. do
we drop monitoring of the builders? what about collectd etc.
What do folks think about this idea? It would dramatically reduce
node entries in our puppet config, it would drop the number of hosts
connecting to puppet, too. It will mean more systems being reinstalled
and more often. It will also require some work to make the steps I
mention above be automated. I think I can achieve that without too
much difficulty, actually. I think, in general, it will increase our
ability to scale up to more and more builders.
main issue is that today we are not 100% sure of how we will install
arm boxes. how do we deal with all the non puppet related systems? also
need to look into how we can better scale koji itself. when we go from
20 to 200+ builders we need to make sure that load doesn't cause koji
to fall over.
all the arm boxes will have management consoles. but today im not 100%
sure how access to that would be. we would also need to deploy fedora
for any arm based systems. things we need to reconsider also is
networking today the storage network and the builder networks are /24's
so we could use 253 nodes. i suspect we will go over that on the build
network. we could not have the storage network on arm builders. it is
really only needed for createrepo. but we may need to look at expanding
kojipkgs to more nodes. or increase its network throughput with multiple
bonded gig network ports. think mass rebuild and 100 or 200 buildroots
initialising at once. it will stress our resources on all levels. but
the flexibility of so many nodes could allow us to deploy solid
solutions to scale and show that fedora is still the leader in open
infrastructure and sets industry best practices.