Re: 4.1-rc6 radi5 OOPS

Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> · Fri, 12 Jun 2015 17:52:58 -0400

Neil Brown <neilb@xxxxxxx> writes:
> On Wed, 10 Jun 2015 12:27:35 -0400
> Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> wrote:
>
>> Neil Brown <neilb@xxxxxxx> writes:
>> > On Wed, 10 Jun 2015 10:19:42 +1000 Neil Brown <neilb@xxxxxxx> wrote:
>> >
>> >> So it looks like some sort of race.  I have other evidence of a race
>> >> with the resync/reshape thread starting/stopping.  If I track that
>> >> down it'll probably fix this issue too.
>> >
>> > I think I have found just such a race.  If you request a reshape just
>> > as a recovery completes, you can end up with two reshapes running.
>> > This causes confusion :-)
>> >
>> > Can you try this patch?  If I can remember how to reproduce my race
>> > I'll test it on that too.
>> >
>> > Thanks,
>> > NeilBrown
>> 
>> Hi Neil,
>> 
>> Thanks for the patch - I tried with this applied, but it still crashed
>> for me :( I had to mangle it manually, somehow it got modified in the
>> email.
>
> Very :-(
>
> I had high hopes for that patch.  I cannot find anything else that could lead
> to what you are seeing.  I wish I could reproduce it but it is probably highly
> sensitive to timing so some hardware shows it and others don't.
>
> It looks very much like two 'resync' threads are running at the same time.
> When one finishes, it sets ->reshape_progress to -1 (MaxSector), which trips up
> the other one.
>
> In the hang that I very rarely see, one thread (presumably) finishes and sets
> MD_RECOVERY_DONE, so the raid5d threads waits for the resync thread to
> complete, and that thread is waiting for the raid5d to retire some stripe_heads.
>
> ... though the 'resync' thread is probably actually doing a 'reshape'...

Neil

Good news - albeit not guaranteed yet. I tried with the full patch that
you sent to Linus, and with that I haven't been able to reproduce the
problem so far. I'll try and do some more testing over the weekend.

The patch I manually applied only had two hunks in it, the one you
pushed to Linus looks a lot more complete :)

> Did you get a chance to bisect it?  I must admit that I doubt that would be
> useful.  It probably starts when "md_start_sync" was introduced and maybe made
> worse when some locking with mddev_lock was relaxed.
>
> The only way I can see a race is if MD_RECOVERY_DONE gets left set.  When a new
> thread is started.  md_check_recovery always clears it before starting a thread,
> but raid5_start_reshape doesn't - or didn't before the patch I gave you.
>
> It might make more sense to clear the bit in md_reap_sync_thread as below,
> but if the first patch didn't work, this one is unlikely to.
>
> Would you be able to test with the following patch?  There is a chance it might
> confirm whether two sync threads are running at the same time.

I can try with this patch on too, but I won't get to it before next
week. It's been a week of non related MD issues.

Thanks a lot!

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html