On Wed, 06 Mar 2013 10:31:55 +0100 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> wrote: > NeilBrown <neilb@xxxxxxx> writes: > > On Tue, 05 Mar 2013 09:44:54 +0100 Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> > > wrote: > >> > Does this fix it? > >> > > >> > NeilBrown > >> > >> Unfortunately no, I still see these crashes with this one applied :( > >> > > > > Thanks - the symptom looked similar, but now that I look more closely I can > > see it is quite different. > > > > How about this then? I can't really see what is happening, but based on the > > patch that you identified it must be related to these flags. > > It seems that handle_stripe_clean_event() is being called to early, and it > > doesn't clear out the ->written bios because they are still locked or > > something. But it does clear R5_Discard on the parity block, so > > handle_stripe_clean_event doesn't get called again. > > > > This makes the handling of the various flags somewhat more uniform, which is > > probably a good thing. > > Hi Neil, > > With this one applied I end up with an OOPS instead. Note I had to > modify the last test/clear bit sequence to use &sh->dev[i].flags instead > of &dev->flags to avoid a compiler warning. Oops. > > I am attaching the test script I am running too. It was written by Eryu > Guan. Thanks for that. I've tried using it but haven't managed to trigger a BUG yet. What size are the loop files? I mostly use fairly small ones, but maybe it needs to be bigger to trigger the problem. My current guess is that recovery and discard are both called on the stripe at the same time and they race and leave the stripe in a slightly confused state. But I haven't found the exact state yet. The discard code always attaches a 'discard' request to every device in a stripe_head all at once, under stripe_lock. However when ops_bio_drain picks those discard requests of ->towrite and puts them on ->written, it takes stripe_lock once per device. Maybe we just need to change the coverage to stripe_lock there to be held for the entire loop. We would still want to drop it before calling async_copy_data() and reclaim afterwards, but that wouldn't affect the 'discard' case. > > > [ 2623.554780] kernel BUG at drivers/md/raid5.c:2954! Could you confirm exactly which line this was - there are a few BUG_ON()s around there. They are all related to R5_UPTODATE not being set I think, but it might help to know exactly when it isn't set. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature