Re: raid5:md3: kernel BUG , followed by , Silent halt .

"Dan Williams" <dan.j.williams@xxxxxxxxx> · Mon, 27 Aug 2007 15:11:00 -0700

On 8/25/07, Mr. James W. Laferriere <babydr@xxxxxxxxxxxxxxxx> wrote:
>         Hello Dan ,
>
> On Mon, 20 Aug 2007, Dan Williams wrote:
> > On 8/18/07, Mr. James W. Laferriere <babydr@xxxxxxxxxxxxxxxx> wrote:
> >>         Hello All ,  Here we go again .  Again attempting to do bonnie++ testing
> >> on a small array .
> >>         Kernel 2.6.22.1
> >>         Patches involved ,
> >>         IOP1 ,  2.6.22.1-iop1 for improved sequential write performance
> >> (stripe-queue) ,  Dan Williams <dan.j.williams@xxxxxxxxx>
> >
> > Hello James,
> >
> > Thanks for the report.
> >
> > I tried to reproduce this on my system, no luck.
>         Possibly because there is significant hardware differances ?
>         See 'lspci -v' below .sig .
>
> > However it looks
> > like their is a potential race between 'handle_queue' and
> > 'add_queue_bio'.  The attached patch moves these critical sections
> > under spin_lock(&sq->lock), and adds some debugging output if this BUG
> > triggers.  It also includes a fix for retry_aligned_read which is
> > unrelated to this debug.
> > --
> > Dan
>         Applied your patch .  The same 'kernel BUG at drivers/md/raid5.c:3689!'
> messages appear (see attached) .  The system is still responsive with your
> patch ,  the kernel crashed last time .  Tho the bonnie++ run is stuck in 'D' .
> And doing a '> /md3/asdf'  stays hung even after passing the parent process a
> 'kill -9' .
>         Any further info You can think of I can/should ,  I will try to acquire
> .  But I'll have to repeat these steps to attempt to get the same results .
> I'll be shutting the system down after sending this off .
>         Fyi ,  the previous 'BUG" without your patch was quite repeatable .
>         I might have time over the next couple of weeks to be able to see if it
> is as repatable as the last one .
>
>         Contents of /proc/mdstat for md3 .
>
> md3 : active raid6 sdx1[3] sdw1[2] sdv1[1] sdu1[0] sdt1[7](S) sds1[6] sdr1[5] sdq1[4]
>        717378560 blocks level 6, 1024k chunk, algorithm 2 [7/7] [UUUUUUU]
>        bitmap: 2/137 pages [8KB], 512KB chunk
>
>         Commands I ran that lead to the 'BUG' .
>
> bonniemd3() { /root/bonnie++-1.03a/bonnie++  -u0:0  -d /md3  -s 131072  -f; }
> bonniemd3 > 131072MB-bonnie++-run-md3-xfs.log-20070825 2>&1 &
>
Ok, the 'bitmap' and 'raid6' details were the missing pieces of my
testing.  I can now reproduce this bug in handle_queue.  I'll keep you
posted on what I find.

Thank you for tracking this.

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html