April 2012 - infrastructure - Fedora Mailing-Lists

by Seth Vidal

The discussion on devel list about ARM and my work last week on reinstalling builders quickly and commonly has raised a number of issues with how we manage our builders and how we should manage them in the future. It is apparent that if we add arm builders they will be lots of physical systems (probably in a very small space) but physical, none-the-less. So we need a sensible way to manage and reinstall these hosts commonly and quickly. Additionally, we need to consider what the introduction of a largish number of arm builders (and other arm infrastructure) would do to our existing puppet setup. Specifically overloading it pretty badly and making it not-very-manageable. I'm making certain assumptions here and I'd like to be clear about what those are: 1. the builders need to be kept pristine 2. that currently our builders are not freshly installed frequently enough. 3. that the builders are relatively static in their configuration and most changes are done with pkg additions 4. that builder setups require at least two manual-ish steps of a koji admin who can disable/enable/register the builder with the kojihub. 5. that the builders are fairly different networking and setup-wise to the rest of our systems. So I am proposing that we consider the following as a general process for maintaining our builders: 1. disable the builder in koji 2. make sure all jobs are finished 3. add installer entries into grub (or run the undefine, reinstall process if the builder is virt-based) 4. reinstall the system 5. monitor for ssh to return 6. connect in and force our post-install configuration: identification, network, mount-point setup, ssl certs/keys for koji, etc 7. reboot 8. re-enable host in koji We would do this with frequency and regularity. Perhaps even having some percentage of our builders doing this at all times. Ie: 1/10th of the boxes reinstalling at any given moment so in a certain time frame*10 all of them are reinstalled. Additionally, this would mean these systems would NOT have a puppet management piece at all. Package updates would still be handled by pushes as we do now, if things were security critical, but barring the need for significant changes we could rely on the boxes simply being refreshed frequently enough that it wouldn't need to be pushed. What do folks think about this idea? It would dramatically reduce the node entries in our puppet config, it would drop the number of hosts connecting to puppet, too. It will mean more systems being reinstalled and more often. It will also require some work to make the steps I mention above be automated. I think I can achieve that without too much difficulty, actually. I think, in general, it will increase our ability to scale up to more and more builders. I'd like input, constructive, please. Thanks, -sv

11 years, 9 months

5
9
0 / 0

qa machine management

by Kevin Fenzi

Greetings. Just had a talk with tflink on IRC about the management of the qa network machines. Long ago when we setup those machines we were thinking we could use them as a testbed for bcfg2 to see if we wanted to start using it or if it worked ok, etc. I setup a bcfg2 server to try this with, but sadly have never found the time to even start configuring it. Machines involved: virthost-comm01.qa (real hardware) autoqa01.qa (guest) autoqa-stg01.qa (guest) lockbox-comm01.qa (guest) bastion-comm01.qa (guest) (someday we may add a sign-bridge-comm01 and sign-vault-comm01 to allow secondary archs like ppc and arm to sign packages). Options: - Try and push forward with a bcfg2 setup on lockbox-comm01.qa and evaluate it. This would be nice, but I'm really not sure anyone has the time to do it. - Just add all the above machines to our puppet repo and configure them there and call it done. This would mean they wouldn't be seperate from us and we just update and configure and monitor them like any other machine. - Try and work out some setup with ansible or the like to see if it could manage them. Again, this would be a learning and tweaking curve, so not sure we have the time. - We could setup a new puppet for them on lockbox-comm01.qa and use that to manage them. We could reuse a lot of our current puppet setup, but it would still be a fair bit of work to get it all configured. Thoughts? Brilliant ideas? kevin

11 years, 10 months

4
10
0 / 0

List of needed package reviews for fedmsg to move forwards

by Ralph Bean

Here is the list of needed package reviews for fedmsg. - 810335 - 720818 - 811689 - 811732 - 811739 - 811750 - 811759 - 811769 - 811782 - 812030 - 812059 These ones are already done - 810033 - 810382 - 810386 These two are not package reviews, but are tickets that need to be resolved in order to move forwards with fedmsg in stg: - 813925 - 813915 All of the above are dependencies of the latest major version bump of Moksha. I haven't yet submitted the review request for python-fedmsg itself, but it's coming soon.

11 years, 12 months

1
1
0 / 0

Seeking for tasks - Program for GSoC returning students

by Buddhike Kurera

Hello GSoC 2012 selection process is over ! We had a high demand from the students and since we are offering some limited slots we couldn't accept all the good students. Therefore we are planning to launch a program for returning students (who didnt select for GSoC with Fedora). The structure of the program is not yet finalized but certainly we need some tasks, so that students can work on those. If you are interested in adding a task to the list[1] please feel free. Please note the number of hours needed to complete the task and contact details of the person who should be contacted in case of getting more information. Nothing is finalized please join with us if you are interested to shape the program. Thanks [1] https://fedoraproject.org/wiki/User:Bckurera/soc/task_list -- Regards, Buddhike Chandradeepa Kurera(bckurera) Fedora Ambassador - APAC region Event Liaison - Design Team Email: bckurera@fedoraproject.org | IRC: bckurera

12 years

1
0
0 / 0

Meeting Agenda Item: Introduction Keith McGrellis

by Keith McGrellis

Hi, My name is Keith McGrellis and I live in Belfast, Northern Ireland. I've worked with linux both at work and at home for about 15 years now. I've used a mixture of distributions, including Red Hat, Debian, Ubuntu and Fedora. My main background is sys admin and scripting (mainly bash and perl). I've experience with: Apache MySQL Bind DHCP OpenLDAP NIS Nagios OpenNMS and other general linux admin. I don't know python but am willing and would like to learn. I would like to be able to help out in whatever way I can. My irc nick is kmcgrell Regards, Keith

12 years

1
0
0 / 0

attempted hosted migration to gluster back end post-mortem

by Seth Vidal

As most/all people know we attempted to migrate hosted to using a gluster backend across two systems on wednesday evening. Thursday we awoke to a host of problems and tackled solving them. Thursday evening we migrated back to our previous configuration. Thanks for the patience on thursday everyone. -sv The below is the explanation of what all happened: Hosted migration started on wednesday afternoon plan was to move to glusterfs from a single node/drbd failover configuration hosted01 and hosted02 would become 'hosted' - both serving files from /srv (our glusterfs share) both systems were clients and servers (in glusters sense): - both systems exporting a brick of the same replica. - both systems mounting that replicated share. when mounting with fuse we started seeing pretty serious performance issues to the point that users were complaining it was not working. It would take 20-30s to render a single ticket from trac. We switched to nfs mounts and performance improved but we saw enormous number of db locking issues on the servers. At this point we contacted the gluster upstream developers who were outrageously helpful in tracking down the problems. After some research it was determined that: - gluster 3.2 over nfs doesn't support any remote locking at all - if we brought things down to 1 node and local_lock=all then things would work and perform 'ok' but would not allow us to access from the other client - this meant we could replicate the fs but not use it from both hosts After moving to gluster over nfs we ran into a new problem: gluster's nfs server does not support --manage-gids so we were restricted to 16 gids per user. No solution outside of new code for this one - investigation into doing that for gluster ++ is occurring jdarcy and pranithk given sysadmin-hosted access to look at logs directly on hosted01/02 to look up on the split-brain reports we were seeing. jdarcy and pranithk tracked the self-heal/split brain problems back to dirs with out of sync fattrs. The only way to solve this was to manually remove the out of sync fattrs after verifying that ONLY the fattrs were out of sync and not any data. this involved looking at all dirs with self-heal problems and running: > setfattr -x trusted.afr.hosted-client-0 /glusterfs/hosted/$dir > setfattr -x trusted.afr.hosted-client-1 /glusterfs/hosted/$dir to clear those settings then reaccessing the dir at: /srv/$dir to force the self-heal to complete correctly. At this point we did not appear to be having self-heal issues but we still have the group-ids limited to 16 under the nfs clients. The only option to resolve that is to patch the gluster nfs server to do the equivalent of --manage-gids. We attempted to see if we could optimize the fuse mounts to work around the nfs limitations. We set the fuse mount up hosted02 and did performance tests - they were 'okay' but not really acceptable. Additionally, after testing fuse enhancements we were informed that fuse suffers from the same 16 gid limitation that nfs suffers from. so we are completely dead in the water. We punted back to hosted03 - re-rsyncing everything back. We also setup a new host: hosted-list01.fedoraproject.org at internetx. This will allow us to move the hosted mailing lists OFF of fedorahosted.org which gains us a lot of latitude in how we move around projects that we did not have before. We will start on the gluster migration + testing if/when we get a patch for 3.3 from jdarcy to handle > 16gids via nfs. If that occurs we will be testing to handle the following problems: Tests to run once we get 3.3 and the > 16 gid patch in place: 1. that nfs locking actually works (test with local_lock=none) and a sqlite3 .dump rm -f /srv/trac/projects/fedora-infrastructure/db/fixed.db sqlite3 /srv/trac/projects/fedora-infrastructure/db/trac.db | sqlite3 /srv/trac/projects/fedora-infrastructure/db/fixed.db 2. that writes with a gid beyond 16 works 3. that performance is palatable: cloning git repos 4. test trac with both systems 5. look for self-healing issues 6. failover testing. Kill one node and confirm other works with limited problems. Things to do before production of gluster: - MOVE GITWEB CACHING OFF OF /srv Much thanks to the gluster dev team in helping us track down where the problems were coming from and attempting to help us fix them. Their help was indispensable.

12 years

1
0
0 / 0

Plan for tomorrow's Fedora Infrastructure meeting (2012-04-26)

by Kevin Fenzi

The infrastructure team will be having it's weekly meeting tomorrow 2012-04-26 at 18:00 UTC in #fedora-meeting on the freenode network. Suggested topics: #topic New folks introductions and Apprentice tasks. If any new folks want to give a quick one line bio or any apprentices would like to ask general questions, they can do so here. #topic two factor auth status #topic Staging re-work status #topic Applications status / discussion Check in on status of our applications: pkgdb, fas, bodhi, koji, community, voting, tagger, packager, dpsearch, etc. If there's new releases, bugs we need to work around or things to note. #topic Meeting time #topic Upcoming Tasks/Items #info 2012-04-29 to 2012-05-03 - Kevin out in the wilds of NM #info 2012-05-01 to 2012-05-15 - F17 Final Freeze. #info 2012-05-01 - nag fi-apprentices. #info 2011-05-03 - gitweb-cache removal day. #info 2012-05-09 - Check if puppet works on f17 yet. #info 2012-05-10 - drop inactive fi-apprentices #info 2012-05-15 - F17 release #topic Meeting tagged tickets: https://fedorahosted.org/fedora-infrastructure/report/10 #topic Open Floor Submit your agenda items, as tickets in the trac instance and send a note replying to this thread. More info here: https://fedoraproject.org/wiki/Infrastructure/Meetings#Meetings Thanks kevin

12 years

1
1
0 / 0

Meeting Agenda Item: Introduction cuiguozheng

by walter

My IRC handle:walter_fedora These are my skills: C programming C++ programming Linux I want to learn more C/C++ and Linux development I would like to work on: #308 Design and deploy mod_security

12 years

3
2
0 / 0

Meeting Agenda Item: Introduction Nelson Pereira

by Nelson Pereira

Hello Team, Information as requested: IRC Handle: Neldogz Skills: I have 8 months of previous experience maintaining 9 RHEL 5.5 Servers within a small corporate data center. (user administration, rpm updates / package management, kernel upgrades, HP PSP updates) I also possess roughly 10 years of general IT experience (Windows server/client) and hold various certifications including the Cisco CCNP. I completed the RHEL Essential's training and most recently took the LPIC-1 course from CBT Nuggets. I really enjoy supporting Linux systems and would like to contribute where possible while continuing to learn general Linux server/desktop administration. -Nelson

12 years

2
1
0 / 0

Hello, I'm new here.

by Domício Medeiros

Hello, people. I hope you are ok. These are my skills: - Java Programming (Desktop / Web) - Intermediary - C Programming - Intermediary - C++ Programming - Intermediary - C# Programming (Beginner) - Shell Script - SQL - Linux - Flex - Computer Networks - Software Engineering and I want to learn: - Python and I want to learn more C, C++, Linux Development and what will come. =)

12 years

3
3
0 / 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

infrastructure April 2012