PROBLEM alert - Host fas03 is DOWN
jonathan at jonmasters.org
Sun Sep 12 00:14:58 UTC 2010
On Sat, 2010-09-11 at 17:09 -0600, Stephen John Smoogen wrote:
> On Sat, Sep 11, 2010 at 11:12, Jon Masters <jcm at redhat.com> wrote:
> > On Sat, 2010-09-11 at 11:40 -0500, Mike McGrath wrote:
> >> On Sat, 11 Sep 2010, Jon Masters wrote:
> >> > > The code in block/blk-core:338 contains an explicit check to ensure that
> >> > > interrupts have been disabled, but this not true since blkif_interrupt
> >> > > is not registered with IRQF_DISABLED set at the time of the setup in
> >> > > bind_evtchn_to_irqhandler. Thus it might be that interrupts are still on
> >> > > when we get to kick_pending_request_queues. Does this always happen?
> >> > >
> >> > > This perhaps happened because upstream removed IRQF_DISABLED and now
> >> > > runs with interrupts disabled in handle_IRQ_event, so Xen won't see
> >> > > this. But on 2.6.32 this change had not yet happened. It's also 2:50am
> >> > > and I might be reading this wrong, but I at least suggest you open a
> >> > > RHEL6 bug and try a more recent kernel build.
> >> > Ah, of course I shouldn't email before bed. There's an obvious giant
> >> > spin_lock_irqsave/restore there, but as noted on xen-devel (when they
> >> > were mulling over moving all of the blkif_interrupt bits into a tasklet
> >> > jut a couple of weeks ago): "It looks like __blk_end_request_all...is
> >> > returning with interrupts enabled sometimes". I pinged some folks.
> >> Just so everyone else knows, I've set kernel.panic to 10 on these hosts so
> >> at least they'll reboot when they panic. Hopefully we can avoid a few
> >> wake-and-reboot issues like we had last night :-/
> > I pinged some folks about it last night. I would hope there will be a
> > fix for that soon. I suspect it's reproducible on the 70+ kernels, but
> > can you check that for us and update the BZ?
> I have fas3 on a .71 kernel. Since they seem to occur at the same time
> I have kept the others at older versions to see if it fixes or misses.
> fas02 will reboot into a .71 if it needs to. I haven't done anything
> to fas01 to keep it prime test grounds.
Well, it makes sense that they'd fire at the same time. There's clearly
some underlying IO path that causes the return with interrupts still on
- perhaps an error path, who knows, I will let others poke or find some
time to dig perhaps next week ;)
More information about the infrastructure