I forgot to mention that there is a flaw in my patch: if user does echo "frozen" > /sys/block/mdX/md/sync_action then resync is aborted, NEEDED flag is also down, so I cannot distinguish between "needs resync" and "has ongoing writes". In that case RMW is performed. If there is a drive failure during resync, then eventually recovery_cp becomes MaxSector (after additional "empty resync", as we discussed long ago). So then it's ok to do RMW, because when we re-add the drive, we will reconstruct its missing blocks. Thanks, Alex. On Thu, Sep 13, 2012 at 7:05 PM, Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > Hi Neil, > below please find the patch. It got pretty ugly, so even if you don't > apply it, please comment. Did I miss anything? > As we discussed some time ago, the usage of mddev->recovery_cp is > overloaded in MD - it indicates both the resync progress and also > whether array is dirty due to ongoing writes. So I tried to somehow > differentiate between these two. I did not manage to repro yet the > "needed" and "transitional" parts. I also don't force reconstruct on > "check" and "repair". I tested the patch on kernel 3.2. > > Finally, I asked you some time ago about how MD treats stripes when it > is rebuilding a drive. So now I dug in the code, and found out that > absolutely everything you told me was correct. Amazing! > > Thanks, > Alex. > > > --------------------------- > From e876cddae5768bc7f6bdcbd617f36ea3a12445aa Mon Sep 17 00:00:00 2001 > From: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> > Date: Thu, 13 Sep 2012 18:55:00 +0300 > Subject: [PATCH] When RAID5 is dirty, force reconstruct-write instead of > read-modify-write. > > Signed-off-by: Alex Lyakas <alex@xxxxxxxxxxxxxxxxx> > Signed-off-by: Yair Hershko <yair@xxxxxxxxxxxxxxxxx> > > diff --git a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > index 5332202..fa03410 100644 > --- a/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > +++ b/ubuntu_kmodules/Ubuntu-3.2.0-25.40/drivers/md/raid5.c > @@ -2561,26 +2561,54 @@ static void handle_stripe_dirtying(struct r5conf *conf, > * look like rcw is cheaper > */ > rcw = 1; rmw = 2; > - } else for (i = disks; i--; ) { > - /* would I have to read this buffer for read_modify_write */ > - struct r5dev *dev = &sh->dev[i]; > - if ((dev->towrite || i == sh->pd_idx) && > - !test_bit(R5_LOCKED, &dev->flags) && > - !(test_bit(R5_UPTODATE, &dev->flags) || > - test_bit(R5_Wantcompute, &dev->flags))) { > - if (test_bit(R5_Insync, &dev->flags)) > - rmw++; > - else > - rmw += 2*disks; /* cannot read it */ > - } > - /* Would I have to read this buffer for reconstruct_write */ > - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && > - !test_bit(R5_LOCKED, &dev->flags) && > - !(test_bit(R5_UPTODATE, &dev->flags) || > - test_bit(R5_Wantcompute, &dev->flags))) { > - if (test_bit(R5_Insync, &dev->flags)) rcw++; > - else > - rcw += 2*disks; > + } else { > + /* Attempt to check whether resync is now happening. > + * If yes, the array is dirty (after unclean shutdown or > + * initial creation), so parity in some stripes might be inconsistent. > + * In this case, we need to always do reconstruct-write, to ensure > + * that in case of drive failure or read-error correction, we > + * generate correct data from the parity. > + */ > + sector_t recovery_cp = conf->mddev->recovery_cp; > + unsigned long recovery = conf->mddev->recovery; > + int needed = test_bit(MD_RECOVERY_NEEDED, &recovery); > + int resyncing = test_bit(MD_RECOVERY_SYNC, &recovery) && > + !test_bit(MD_RECOVERY_REQUESTED, &recovery) && > + !test_bit(MD_RECOVERY_CHECK, &recovery); > + int transitional = test_bit(MD_RECOVERY_RUNNING, &recovery) && > + !test_bit(MD_RECOVERY_SYNC, &recovery) && > + !test_bit(MD_RECOVERY_RECOVER, &recovery) && > + !test_bit(MD_RECOVERY_DONE, &recovery) && > + !test_bit(MD_RECOVERY_RESHAPE, &recovery); > + > + /* If array is dirty and should start resync or already resyncing */ > + if (recovery_cp < MaxSector && sh->sector >= recovery_cp && > + (needed || resyncing || transitional)) { > + /* Force reconstruct-write */ > + rcw = 1; rmw = 2; > + pr_debug("dirty, force RCW recovery_cp=%lu sh->sector=%lu recovery=0x%lx\n", > + recovery_cp, sh->sector, recovery); > + } else for (i = disks; i--; ) { > + /* would I have to read this buffer for read_modify_write */ > + struct r5dev *dev = &sh->dev[i]; > + if ((dev->towrite || i == sh->pd_idx) && > + !test_bit(R5_LOCKED, &dev->flags) && > + !(test_bit(R5_UPTODATE, &dev->flags) || > + test_bit(R5_Wantcompute, &dev->flags))) { > + if (test_bit(R5_Insync, &dev->flags)) > + rmw++; > + else > + rmw += 2*disks; /* cannot read it */ > + } > + /* Would I have to read this buffer for reconstruct_write */ > + if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && > + !test_bit(R5_LOCKED, &dev->flags) && > + !(test_bit(R5_UPTODATE, &dev->flags) || > + test_bit(R5_Wantcompute, &dev->flags))) { > + if (test_bit(R5_Insync, &dev->flags)) rcw++; > + else > + rcw += 2*disks; > + } > } > } > pr_debug("for sector %llu, rmw=%d rcw=%d\n", > > > > > > On Thu, Sep 13, 2012 at 3:19 AM, NeilBrown <neilb@xxxxxxx> wrote: >> On Wed, 12 Sep 2012 19:49:52 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> >> wrote: >> >>> Hi Neil, >>> I have done some more investigation on that. >>> >>> I see that according to handle_stripe_dirtying(), raid6 always does >>> reconstruct-write, while raid5 checks what will be cheaper in terms of >>> IOs - read-modify-write or reconstruct-write. For example, for a >>> 3-drive raid5, both are the same, so because of: >>> >>> if (rmw < rcw && rmw > 0) >>> ... /* this is not picked, because in this case rmw==rcw==1 */ >>> >>> reconstruct-write is always performed for such 3-drvie raid5. Is this correct? >> >> Yes. >> >>> >>> The issue with doing read-modify-writes is that later we have no >>> reliable way to know whether the parity block is correct - when we >>> later do reconstruct-write because of a read error, for example. For >>> read requests we could have perhaps checked the bitmap, and do >>> reconstruct-write if the relevant bit is not set, but for write >>> requests the relevant bit will always be set, because it is set when >>> the write is started. >>> >>> I tried the following scenario, which showed a data corruption: >>> # Create 4-drive raid5 in "--force" mode, so resync starts >>> # Write one sector on a stripe that resync has not handled yet. RMW is >>> performed, but the parity is incorrect because two other data blocks >>> were not taken into account (they contain garbage). >>> # Induce a read-error on the sector that I just wrote to >>> # Let resync handle this stripe >>> >>> As a result, resync corrects my sector using other two data blocks + >>> parity block, which is out of sync. When I read back the sector, data >>> is incorrect. >>> >>> I see that I can easily enforce raid5 to always do reconstruct-write, >>> the same way like you do for raid6. However, I realize that for >>> performance reasons, it is better to do RMW if possible. >>> >>> What do you think about the following rough suggestion: in >>> handle_stripe_dirtying() check whether resync is ongoing or should be >>> started - using MD_RECOVERY_SYNC, for example. If there is an ongoing >>> resync, there is a good reason for that, probably parity on some >>> stripes is out of date. So in that case, always force >>> reconstruct-write. Otherwise, count what is cheaper like you do now. >>> (Can RCW be really cheaper than RMW?) >>> >>> So during resync, array performance will be lower, but we will ensure >>> that all stripe-blocks are consistent. What do you think? >> >> I'm fairly sure we used to do that - long long ago. (hunts through git >> history...) No. The code-fragment was there but it was commented out. >> >> I think it would be good to avoid 'rmw' if the sector offset is less than >> recovery_cp. >> >> Care to write a patch? >> >> Thanks, >> NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html