On Fri, 06 Jul 2012 11:59:13 +0200 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> wrote: > NeilBrown <neilb@xxxxxxx> writes: > > On Tue, 03 Jul 2012 18:07:02 +0200 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> > > wrote: > > > >> NeilBrown <neilb@xxxxxxx> writes: > >> > On Mon, 02 Jul 2012 15:24:43 +0200 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> > >> > wrote: > >> > > >> >> Hi Neil, > >> >> > >> >> I am trying to get the test suite stable on RHEL, but I see a lot of > >> >> failures in 03r5assemV1, in particular between these two cases: > >> >> > >> >> mdadm -A $md1 -u $uuid $devlist > >> >> check state U_U > >> >> eval $tst > >> >> > >> >> mdadm -A $md1 --name=one $devlist > >> >> check state U_U > >> >> check spares 1 > >> >> eval $tst > >> >> > >> >> I have tested it with the latest upstream kernel as well and see the > >> >> same problems. I suspect it is simply the box that is too fast, ending > >> >> up with the raid check completing inbetween the two test cases? > >> >> > >> >> Are you seeing the same thing there? I tried playing with the max speed > >> >> variable but it doesn't really seem to make any difference. > >> >> > >> >> Any ideas for what we can be done to make this case more resilient to > >> >> false positives? I guess one option would be to re-create the array > >> >> inbetween each test? > >> > > >> > Maybe it really is a bug? > >> > The test harness set the resync speed to be very slow. A fast box will get > >> > through the test more quickly and be more likely to see the array still > >> > syncing. > >> > > >> > I'll try to make time to look more closely. > >> > But I wouldn't discount the possibility that the second "mdadm -A" is > >> > short-circuiting the recovery somehow. > >> > >> That could certainly explain what I am seeing. I noticed it doesn't > >> happen every single time in the same place (from memory), but it is > >> mostly in that spot in my case. > >> > >> Even if I trimmed the max speed down to 50 it still happens. > > > > I cannot easily reproduce this. > > Exactly which kernel and which mdadm do you find it with - just to make sure > > I'm testing the same thing as you? > > Hi Neil, > > Odd - I see it with > mdadm: 721b662b5b33830090c220bbb04bf1904d4b7eed > kernel: ca24a145573124732152daff105ba68cc9a2b545 > > I've seen this happen for a while fwiw. > > Note the box has a number of external drives with a number of my scratch > raid arrays on it. It shouldn't affect this, but just in case. > > The system installed mdadm is a 3.2.3 derivative, but I checked running > with PATH=. as well. Thanks. I think I figured out what is happening. It seems that setting the max_speed down to 1000 is often enough, but not always. So we need to set it lower. But setting max_speed lowers is not effective unless you also set min_speed lower. This is the tricky bit that took me way too long to realised. So with this patch, it is quite reliable. NeilBrown diff --git a/tests/03r5assemV1 b/tests/03r5assemV1 index 52b1107..bca0c58 100644 --- a/tests/03r5assemV1 +++ b/tests/03r5assemV1 @@ -60,7 +60,8 @@ eval $tst ### Now with a missing device # We don't want the recovery to complete while we are # messing about here. -echo 1000 > /proc/sys/dev/raid/speed_limit_max +echo 100 > /proc/sys/dev/raid/speed_limit_max +echo 100 > /proc/sys/dev/raid/speed_limit_min mdadm -AR $md1 $dev0 $dev2 $dev3 $dev4 # check state U_U @@ -124,3 +125,4 @@ mdadm -I -c $conf $dev1 mdadm -I -c $conf $dev2 eval $tst echo 2000 > /proc/sys/dev/raid/speed_limit_max +echo 1000 > /proc/sys/dev/raid/speed_limit_min
Attachment:
signature.asc
Description: PGP signature