Re: raid5:md3: kernel BUG , followed by , Silent halt .

"Mr. James W. Laferriere" <babydr@xxxxxxxxxxxxxxxx> · Wed, 5 Sep 2007 21:45:57 -0700 (PDT)

	Hello Dan ,
On Mon, 27 Aug 2007, Dan Williams wrote:
On 8/25/07, Mr. James W. Laferriere <babydr@xxxxxxxxxxxxxxxx> wrote:
On Mon, 20 Aug 2007, Dan Williams wrote:
On 8/18/07, Mr. James W. Laferriere <babydr@xxxxxxxxxxxxxxxx> wrote:
        Hello All ,  Here we go again .  Again attempting to do bonnie++ testing
on a small array .
        Kernel 2.6.22.1
        Patches involved ,
        IOP1 ,  2.6.22.1-iop1 for improved sequential write performance
(stripe-queue) ,  Dan Williams <dan.j.williams@xxxxxxxxx>

Hello James,

Thanks for the report.

I tried to reproduce this on my system, no luck.
        Possibly because there is significant hardware differances ?
        See 'lspci -v' below .sig .

However it looks
like their is a potential race between 'handle_queue' and
'add_queue_bio'.  The attached patch moves these critical sections
under spin_lock(&sq->lock), and adds some debugging output if this BUG
triggers.  It also includes a fix for retry_aligned_read which is
unrelated to this debug.
--
Dan
        Applied your patch .  The same 'kernel BUG at drivers/md/raid5.c:3689!'
messages appear (see attached) .  The system is still responsive with your
patch ,  the kernel crashed last time .  Tho the bonnie++ run is stuck in 'D' .
And doing a '> /md3/asdf'  stays hung even after passing the parent process a
'kill -9' .
        Any further info You can think of I can/should ,  I will try to acquire
.  But I'll have to repeat these steps to attempt to get the same results .
I'll be shutting the system down after sending this off .
        Fyi ,  the previous 'BUG" without your patch was quite repeatable .
        I might have time over the next couple of weeks to be able to see if it
is as repatable as the last one .

        Contents of /proc/mdstat for md3 .

md3 : active raid6 sdx1[3] sdw1[2] sdv1[1] sdu1[0] sdt1[7](S) sds1[6] sdr1[5] sdq1[4]
       717378560 blocks level 6, 1024k chunk, algorithm 2 [7/7] [UUUUUUU]
       bitmap: 2/137 pages [8KB], 512KB chunk

        Commands I ran that lead to the 'BUG' .

bonniemd3() { /root/bonnie++-1.03a/bonnie++  -u0:0  -d /md3  -s 131072  -f; }
bonniemd3 > 131072MB-bonnie++-run-md3-xfs.log-20070825 2>&1 &

Ok, the 'bitmap' and 'raid6' details were the missing pieces of my
testing.  I can now reproduce this bug in handle_queue.  I'll keep you
posted on what I find.

Thank you for tracking this.
Regards,

	You said to watch here & I have .
	Is there any hope of digging this out ?
	Anything further I can provide ?  Please just say so .
		Tia ,  JimL
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html