On Fri, 12 Jun 2015 17:52:58 -0400 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> wrote: > Neil Brown <neilb@xxxxxxx> writes: > > On Wed, 10 Jun 2015 12:27:35 -0400 > > Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> wrote: > > > >> Neil Brown <neilb@xxxxxxx> writes: > >> > On Wed, 10 Jun 2015 10:19:42 +1000 Neil Brown <neilb@xxxxxxx> wrote: > >> > > >> >> So it looks like some sort of race. I have other evidence of a race > >> >> with the resync/reshape thread starting/stopping. If I track that > >> >> down it'll probably fix this issue too. > >> > > >> > I think I have found just such a race. If you request a reshape just > >> > as a recovery completes, you can end up with two reshapes running. > >> > This causes confusion :-) > >> > > >> > Can you try this patch? If I can remember how to reproduce my race > >> > I'll test it on that too. > >> > > >> > Thanks, > >> > NeilBrown > >> > >> Hi Neil, > >> > >> Thanks for the patch - I tried with this applied, but it still crashed > >> for me :( I had to mangle it manually, somehow it got modified in the > >> email. > > > > Very :-( > > > > I had high hopes for that patch. I cannot find anything else that could lead > > to what you are seeing. I wish I could reproduce it but it is probably highly > > sensitive to timing so some hardware shows it and others don't. > > > > It looks very much like two 'resync' threads are running at the same time. > > When one finishes, it sets ->reshape_progress to -1 (MaxSector), which trips up > > the other one. > > > > In the hang that I very rarely see, one thread (presumably) finishes and sets > > MD_RECOVERY_DONE, so the raid5d threads waits for the resync thread to > > complete, and that thread is waiting for the raid5d to retire some stripe_heads. > > > > ... though the 'resync' thread is probably actually doing a 'reshape'... > > Neil > > Good news - albeit not guaranteed yet. I tried with the full patch that > you sent to Linus, and with that I haven't been able to reproduce the > problem so far. I'll try and do some more testing over the weekend. > > The patch I manually applied only had two hunks in it, the one you > pushed to Linus looks a lot more complete :) Thanks for testing. I'm fairly sure you issue is fixed now, but it is very nice to have it confirmed. > > > Did you get a chance to bisect it? I must admit that I doubt that would be > > useful. It probably starts when "md_start_sync" was introduced and maybe made > > worse when some locking with mddev_lock was relaxed. > > > > The only way I can see a race is if MD_RECOVERY_DONE gets left set. When a new > > thread is started. md_check_recovery always clears it before starting a thread, > > but raid5_start_reshape doesn't - or didn't before the patch I gave you. > > > > It might make more sense to clear the bit in md_reap_sync_thread as below, > > but if the first patch didn't work, this one is unlikely to. > > > > Would you be able to test with the following patch? There is a chance it might > > confirm whether two sync threads are running at the same time. > > I can try with this patch on too, but I won't get to it before next > week. It's been a week of non related MD issues. Don't bother - that one is just an early version of one that went to Linus, so you have tested the important bit. Thanks, NeilBrown > > Thanks a lot! > > Cheers, > Jes > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html