Kernel-3.1 Crash

Thu Oct 27 19:25:43 UTC 2011

On Thu, Oct 27, 2011 at 03:20:51PM -0400, Jeff Moyer wrote:
> Don Zickus <dzickus at redhat.com> writes:
> 
> > On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
> >> >> This doesn't look like the same problem.  Here we've got BUG: scheduling
> >> >> while atomic.  If it was the bug fixed by the above commits, then you
> >> >> would hit a BUG_ON.  I would start looking at the btrfs bits to see if
> >> >> they're holding any locks in this code path.
> >> >
> >> > Ignore that one and move to IMG_0350.IMG.  'scheduling while atomic' is
> >> > just noise.  Besides Mike and Vivek told me to blame you for not pushing
> >> > Jens harder on these fixes. :-)))))
> >> 
> >> I'm looking at 0355, which shows the very top of the trace, and that
> >> says BUG: scheduling while atomic.  So the problem reported here *is*
> >> different from the one fixed by the above two commits.  In fact, I don't
> >> see evidence of the multipath + flush issue in any of these pictures.
> >
> > You have to ignore the 'schedule while atomic' thing it is just a
> >
> > printk("BUG: scheduling while atomic"), it is _not_ a BUG().  :-)
> > (hint read kernel/sched.c::__schedule_bug)
> >
> > I see those messages all the time, it really should be a WARN and not a
> > misleading BUG, but whatever. 
> >
> > His machine died because the NMI watchdog detected a lockup.  The lockup
> > was because in blk_insert_cloned_request(), spin_lock_irqsave disabled
> > interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).
> >
> > Mike and Vivek both said that is what you fixed for 3.2.  They also said
> > the only caller of blk_insert_cloned_request() is multipath, hence that
> > argument.  I'll cc them.  Or maybe I can have them walk over to your cube.
> > :-)
> 
> Well then they know more than I do.  The bug I fixed would not result in
> infinite spinning on the queue lock.  It resulted in a BUG_ON in
> blk_insert_flush, since req->bio was NULL.  So again, I really don't see
> how this is related.  We could put this all to rest by asking the victim
> to try out those two patches.

Sorry for the confusion here. We saw the blk_insert_cloned_request() in
the trace and thought it could be related to your fixes. Did not think
about exact symtom of the problem in your case. So you are right. Here
we are spinning on spinlock infinitely and your patch fixed the BUG_ON().
So may be it is a different issue.

Thanks
Vivek