PROBLEM alert - Host fas03 is DOWN

Sun Sep 12 16:21:15 UTC 2010

On Sun, 2010-09-12 at 10:12 -0600, Stephen John Smoogen wrote:
> On Sun, Sep 12, 2010 at 09:46, Jon Masters <jonathan at jonmasters.org> wrote:
> > On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
> >> On Sat, 11 Sep 2010, Jon Masters wrote:
> >>
> >> > On Sat, 2010-09-11 at 02:51 -0400, Jon Masters wrote:
> >> > > On Fri, 2010-09-10 at 19:24 -0600, Stephen John Smoogen wrote:
> >> > >
> >> > > > Sep 11 01:10:23 fas03 kernel: WARNING: at block/blk-core.c:338
> >> > >
> >> > > > Sep 11 01:10:23 fas03 kernel: [<c044fc97>] ? warn_slowpath_common+0x77/0xb0
> >> > > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> >> > > > Sep 11 01:10:23 fas03 kernel: [<c044fce3>] ? warn_slowpath_null+0x13/0x20
> >> > > > Sep 11 01:10:23 fas03 kernel: [<c05ca5dc>] ? blk_start_queue+0x6c/0x70
> >> > > > Sep 11 01:10:23 fas03 kernel: [<ed63896b>] ?
> >> > > > kick_pending_request_queues+0x1b/0x30 [xen_blkfront]
> >> > > > Sep 11 01:10:23 fas03 kernel: [<ed638b80>] ?
> >> > > > blkif_interrupt+0x200/0x220 [xen_blkfront]
> >> > > > Sep 11 01:10:23 fas03 kernel: [<c04ad7c5>] ? handle_IRQ_event+0x45/0x140
> >> > >
> >> > > The code in block/blk-core:338 contains an explicit check to ensure that
> >> > > interrupts have been disabled, but this not true since blkif_interrupt
> >> > > is not registered with IRQF_DISABLED set at the time of the setup in
> >> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> >> > > when we get to kick_pending_request_queues. Does this always happen?
> >> > >
> >> > > This perhaps happened because upstream removed IRQF_DISABLED and now
> >> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> >> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> >> > > and I might be reading this wrong, but I at least suggest you open a
> >> > > RHEL6 bug and try a more recent kernel build.
> >> >
> >> > Ah, of course I shouldn't email before bed. There's an obvious giant
> >> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> >> > were mulling over moving all of the blkif_interrupt bits into a tasklet
> >> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> >> > returning with interrupts enabled sometimes". I pinged some folks.
> >> >
> >>
> >> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
> >> at least they'll reboot when they panic.  Hopefully we can avoid a few
> >> wake-and-reboot issues like we had last night :-/
> >
> > Mike, is there any chance you could boot the -debug kernel on one of
> > these affected systems? Also, can you let us know about the host?
> >
> 
> kernel.panic set to 10 did not reboot the systems. What and where is a
> debug kernel?

I'm not sure where you get them externally. But internally, if you go to
brewweb.devel you will see for the kernel package that there are
variants like "kernel-debug". Please install that one, since it has lots
of extra debugging options turned on. It'll run more slowly, but I doubt
it will be noticeable (and the system is already crashing, so...).

Then make sure you have all of the logs going somewhere useful. Do you
have any (virtual) serial console setup that you are using to capture
the panic output and from which you could capture kernel messages if you
set the console loglevel appropriately? Do you have the ability to
install another guest on the host system that could be used for
debugging this problem? (assuming it is always reproducible)?

Also, please do give me some info on the host system, etc. I am not
necessarily going to have time to fix this myself, but I am attempting
to ensure that all of the necessary data is at least available tomorrow.

Jon.