HI Neil, can you please comment what do you think of this change. Thanks, Alex. On Wed, Feb 19, 2014 at 11:51 AM, Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > Hello Neil, Andrei, > I don't how much you recall of this old discussion. > > Basically, the change that you made means that md doesn't update the > superblock on the recovering device, until recovery completes. As a > result, when assembling such array, the recovering device has an old > event count in the superblock and is not picked during assembly. So > later, user has to re-add it manually. > This is true for a device that failed and was re-added. For a fresh > device, saved_raid_disk==-1, so its superblock will still be updated > (and sucn device will be picked on assembly). > > On the other hand, MD updates the superblock of the In_sync devices > and marks the recovering device as having a valid array slot (not > 0xFFFF or 0xFFFE, which are treated similarly today). > > What do you think of the following change: > diff --git a/drivers/md/md.c b/drivers/md/md.c > index 561a65f..4bbc7e3 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -1854,8 +1854,6 @@ retry: > sb->dev_roles[i] = cpu_to_le16(0xfffe); > else if (test_bit(In_sync, &rdev2->flags)) > sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); > - else if (rdev2->raid_disk >= 0) > - sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk); > else > sb->dev_roles[i] = cpu_to_le16(0xffff); > } > > Basically, we do not mark the slot of the recovering device, until its > recovery completes. So during assembly, we will not pick it, as > before. For a fresh drive, there will be a regression - we will not > pick it as well on assembly. > > The reason I am considering this change is another old discussion that > we had - considering split-brain resolution, and the proposal I made > in: > https://docs.google.com/document/d/1sgO7NgvIFBDccoI3oXp9FNzB6RA5yMwqVN3_-LMSDNE > > Basically, when MD marks recovering devices as having a valid raid > slot in the superblock of In_sync devices, then on array assembly we > don't know whether sucn device is recovering or In_sync (if it is > inaccessible during assembly). So we have to assume it is In_sync and > thus can potentially cause split-brain. > > > What do you think of this change? > > Thanks, > Alex. > > > On Wed, Jun 27, 2012 at 9:22 AM, NeilBrown <neilb@xxxxxxx> wrote: >> On Thu, 7 Jun 2012 15:47:46 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> >> wrote: >> >>> Thanks for commenting, Neil, >>> >>> On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@xxxxxxx> wrote: >>> > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> >>> > wrote: >>> > >>> >> Hi again Neil, and Andrey, >>> >> >>> >> looking at this email thread: >>> >> http://www.spinics.net/lists/raid/msg36236.html >>> >> between you and Andrey, the conclusion was: >>> >> "So the correct thing to do is to *not* update the metadata on the >>> >> recovering device until recovery completes. Then if it fails and is >>> >> re-added, it will look just the same as when it was re-added the first >>> >> time, and will do a bitmap-based recovery." >>> >> >>> >> I have two doubts about this decision: >>> >> 1) Since the event count on the recovering drive is not updated, this >>> >> means that after reboot, when array is re-assembled, this drive will >>> >> not be added to the array, and the user will have to manually re-add >>> >> it. I agree this is a minor thing. >>> > >>> > Still, if it can be fixed it should be. >>> > >>> >> 2) There are places in mdadm, which check for recovery_offset on the >>> >> drive and take decisions based upon that. Specifically, if there is >>> >> *no* recovery offset, the data on this drive is considered to be >>> >> consistent WRT to the failure time of the drive. So, for example, the >>> >> drive can be a candidate for bumping up events during "force >>> >> assembly". Now, when superblock on such drive is not updated during >>> >> recovery (so there is *no* recovery offset), mdadm will think that the >>> >> drive is consistent, while in fact, its data is totally unusable until >>> >> after recovery completes. That's because we have updated parts of the >>> >> drive, but did not complete bringing the whole drive in-sync. >>> > >>> > If mdadm would consider updating the event count if not recovery had started, >>> > then surely it is just as valid to do so once some recovery has started, even >>> > if it hasn't completed. >>> >>> The patch you accepted from me ("Don't consider disks with a valid >>> recovery offset as candidates for bumping up event count") actually >>> attempts to protect from that:) >>> >>> I don't understand why "it is just as valid to do so once some >>> recovery has started". My understanding is that once recovery of a >>> drive has started, its data is not consistent between different parts >>> of the drive, until the recovery completes. This is because md does >>> bitmap-based recovery, and not kind of journal/transaction-log based >>> recovery. >>> >>> However, one could argue that for force-assembly case, when data >>> anyways can come up as partially corrupted, this is less important. >> >> Exactly. And mdadm only updates event counts in the force-assembly case so >> while it might not be ideal, it is the best we can do. >> >>> >>> I would still think that there is value in recoding in a superblock >>> that a drive is recovering. >> >> Probably. It is a bit unfortunate that if you stop an array that is >> recovering after a --re-add, you cannot simply 'assemble' it again and >> get it back to the same state. >> I'll think more on that. >> >> Meanwhile, this patch might address your other problem. It allows --re-add >> to work if a non-bitmap rebuild fails and is then re-added. >> >> diff --git a/drivers/md/md.c b/drivers/md/md.c >> index c601c4b..d31852e 100644 >> --- a/drivers/md/md.c >> +++ b/drivers/md/md.c >> @@ -5784,7 +5784,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info) >> super_types[mddev->major_version]. >> validate_super(mddev, rdev); >> if ((info->state & (1<<MD_DISK_SYNC)) && >> - (!test_bit(In_sync, &rdev->flags) || >> + (test_bit(Faulty, &rdev->flags) || >> rdev->raid_disk != info->raid_disk)) { >> /* This was a hot-add request, but events doesn't >> * match, so reject it. >> >> >>> >>> > >>> >> >>> >> Can you pls comment on my doubts? >>> > >>> > I think there is probably room for improvement here but I don't think there >>> > are any serious problems. >>> > >>> > However I'm about to go on leave for a couple of week so I'm unlikely to >>> > think about it for a while. I've made a note to look at it properly when I >>> > get back. >>> > >>> >>> Indeed, don't you think about this while you are resting! >> >> :-) >> >> >> Thanks. I did have a very enjoyable break. >> >> NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html