Re: tests/03r5assemV1 issues

NeilBrown <neilb@xxxxxxx> · Wed, 11 Jul 2012 14:20:53 +1000

On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx>
wrote:

> NeilBrown <neilb@xxxxxxx> writes:
> > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx>
> > wrote:
> >
> >> NeilBrown <neilb@xxxxxxx> writes:
> >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx>
> >> > wrote:
> >> >
> >> >> Hi Neil,
> >> >> 
> >> >> I am trying to get the test suite stable on RHEL, but I see a lot of
> >> >> failures in 03r5assemV1, in particular between these two cases:
> >> >> 
> >> >> mdadm -A $md1 -u $uuid $devlist
> >> >> check state U_U
> >> >> eval $tst
> >> >> 
> >> >> mdadm -A $md1 --name=one $devlist
> >> >> check state U_U
> >> >> check spares 1
> >> >> eval $tst
> >> >> 
> >> >> I have tested it with the latest upstream kernel as well and see the
> >> >> same problems. I suspect it is simply the box that is too fast, ending
> >> >> up with the raid check completing inbetween the two test cases?
> >> >> 
> >> >> Are you seeing the same thing there? I tried playing with the max speed
> >> >> variable but it doesn't really seem to make any difference.
> >> >> 
> >> >> Any ideas for what we can be done to make this case more resilient to
> >> >> false positives? I guess one option would be to re-create the array
> >> >> inbetween each test?
> >> >
> >> > Maybe it really is a bug?
> >> > The test harness set the resync speed to be very slow.  A fast box will get
> >> > through the test more quickly and be more likely to see the array still
> >> > syncing.
> >> >
> >> > I'll try to make time to look more closely.
> >> > But I wouldn't discount the possibility that the second "mdadm -A" is
> >> > short-circuiting the recovery somehow.
> >> 
> >> That could certainly explain what I am seeing. I noticed it doesn't
> >> happen every single time in the same place (from memory), but it is
> >> mostly in that spot in my case.
> >> 
> >> Even if I trimmed the max speed down to 50 it still happens.
> >
> > I cannot easily reproduce this.
> > Exactly which kernel and which mdadm do you find it with - just to make sure
> > I'm testing the same thing as you?
> 
> Hi Neil,
> 
> Odd - I see it with
> mdadm:  721b662b5b33830090c220bbb04bf1904d4b7eed
> kernel: ca24a145573124732152daff105ba68cc9a2b545
> 
> I've seen this happen for a while fwiw.
> 
> Note the box has a number of external drives with a number of my scratch
> raid arrays on it. It shouldn't affect this, but just in case.
> 
> The system installed mdadm is a 3.2.3 derivative, but I checked running
> with PATH=. as well.

Thanks.
I think I figured out what is happening.

It seems that setting the max_speed down to 1000 is often enough, but not
always.  So we need to set it lower.
But setting max_speed lowers is not effective unless you also set min_speed
lower.  This is the tricky bit that took me way too long to realised.

So with this patch, it is quite reliable.

NeilBrown

diff --git a/tests/03r5assemV1 b/tests/03r5assemV1
index 52b1107..bca0c58 100644
--- a/tests/03r5assemV1
+++ b/tests/03r5assemV1
@@ -60,7 +60,8 @@ eval $tst
 ### Now with a missing device
 # We don't want the recovery to complete while we are
 # messing about here.
-echo 1000 > /proc/sys/dev/raid/speed_limit_max
+echo 100 > /proc/sys/dev/raid/speed_limit_max
+echo 100 > /proc/sys/dev/raid/speed_limit_min
 
 mdadm -AR $md1 $dev0 $dev2 $dev3 $dev4 #
 check state U_U
@@ -124,3 +125,4 @@ mdadm -I -c $conf $dev1
 mdadm -I -c $conf $dev2
 eval $tst
 echo 2000 > /proc/sys/dev/raid/speed_limit_max
+echo 1000 > /proc/sys/dev/raid/speed_limit_min
Attachment:
signature.asc

Description: PGP signature