Apache webserver outage - need help with forensics

Fri Apr 15 08:06:30 UTC 2005

On 4/14/05, Bob Brennan <rbrennan96 at gmail.com> wrote:
> On 4/14/05, replies-lists-redhat at listmail.innovate.net
> <replies-lists-redhat at listmail.innovate.net> wrote:
> >
> > i haven't been following this in great detail, so this may have been
> > mentioned already.
> >
> > if there are issues with high machine load http connections won't
> > close. when that happens you'll hit the maxclients level and your
> > http server will stop accepting connections.
> >
> > if you (as appears to be the case) aren't monitoring things like your
> > machine load (yet) you can look in your /var/log/maillog file for
> > high load hints during this incident. sendmail (but not postfix) will
> > stop accepting mail when the load gets above a certain point (default
> > is 12 i believe). when this happens it writes that to the maillog
> > file.
> 
> I suspected this too since it is a somewhat ram-poor machine and I had
> just started up spamassasin which is a known resource-hog.
> 
> I can see all emails that went through plus those that got
> spam-bucketed, not a lot (single digits per hour) and no problems
> recorded. Mail volume was down for those 20 minutes onlly because most
> users depend on squirrelmail which was obviously down at the time.
> Test emails from dnsCheck (instigated by me) did get recorded during
> the outage.
> 
> If sendmail does in fact log a high load condition then that rules it
> out since there is no record.
> 
> > that will give you info on whether the issue was high load. if it
> > was, then you should set up some scripts that do monitoring so that
> > you can pin point the underlying issue(s).
> >
> > for monitoring, vmstat, uptime (for the load numbers) and top (in
> > batchmode - kicking in only when the load hits some threshold) are
> > all very useful.
> >
> >    - Rick
> >

Update - my daily logwatch recorded a typical number of TCP/IP packets
that day, averaging around 800-900 which is an average load and no
large numbers from any single IP. So not likely it was any kind of DOS
atttack.

We've ruled out a lot of things here, but still no clue as to what
actually happened. I would lean towards a hardware failure but the
machine has been on 24/7 for many months without a hiccup, other
services were running normally, and httpd recovered 100% after 20
minutes all by itself. Or never really went away - was just
unreachable? Firestarter as a possibility (has been installed for
about 1 month)?

Thanks in advance for any more ideas,
bob