On Thu, Oct 27, 2011 at 02:43:22PM -0400, Jeff Moyer wrote:
>> This doesn't look like the same problem. Here we've
got BUG: scheduling
>> while atomic. If it was the bug fixed by the above commits, then you
>> would hit a BUG_ON. I would start looking at the btrfs bits to see if
>> they're holding any locks in this code path.
>
> Ignore that one and move to IMG_0350.IMG. 'scheduling while atomic' is
> just noise. Besides Mike and Vivek told me to blame you for not pushing
> Jens harder on these fixes. :-)))))
I'm looking at 0355, which shows the very top of the trace, and that
says BUG: scheduling while atomic. So the problem reported here *is*
different from the one fixed by the above two commits. In fact, I don't
see evidence of the multipath + flush issue in any of these pictures.
You have to ignore the 'schedule while atomic' thing it is just a
printk("BUG: scheduling while atomic"), it is _not_ a BUG(). :-)
(hint read kernel/sched.c::__schedule_bug)
I see those messages all the time, it really should be a WARN and not a
misleading BUG, but whatever.
His machine died because the NMI watchdog detected a lockup. The lockup
was because in blk_insert_cloned_request(), spin_lock_irqsave disabled
interrupts and spun forever waiting on the q->queue_lock (IMG_0350.JPG).
Mike and Vivek both said that is what you fixed for 3.2. They also said
the only caller of blk_insert_cloned_request() is multipath, hence that
argument. I'll cc them. Or maybe I can have them walk over to your cube.
:-)
Cheers,
Don