Strange week last week, many of you noticed a bunch of nagios outages so I
thought I'd send a roundup of what happened.
1) The big one was what seems to be a corrupt database table. For some
reason running a vacuum on a table (which was only 66M large) was taking a
long time and even after it would finish the disks would thrash for
sometimes 10 minutes after. This caused outages of lots of our systems
like the account system, to which other systems depend. The job was
hourly so thats why it kept happening.
We were able to reproduce this on another host and never quite figured out
what was going on but a dump, drop, restore fixed the issue and so far we
haven't had time to revisit what was going on, just that it hasn't
happened since.
2) Strange network issues towards the end of the week. Seems our round
time to server beach went up causing nagios to flag some hosts as dead.
I've also not yet had time to look into this. The network seems and I
don't think we're seeing any functional issues from it but it was
different.
3) pkgdb's home page started taking longer to load causing our balancer to
start flagging it dead causing it to throw 503's. We only recently moved
it to haproxy so this could be a normal behavior that we just didn't see.
I've moved response time of the front page up to 5 seconds from 2.
-Mike