Thanks for commenting, Neil, On Thu, Jun 7, 2012 at 2:24 PM, NeilBrown <neilb@xxxxxxx> wrote: > On Thu, 7 Jun 2012 10:22:24 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: > >> Hi again Neil, and Andrey, >> >> looking at this email thread: >> http://www.spinics.net/lists/raid/msg36236.html >> between you and Andrey, the conclusion was: >> "So the correct thing to do is to *not* update the metadata on the >> recovering device until recovery completes. Then if it fails and is >> re-added, it will look just the same as when it was re-added the first >> time, and will do a bitmap-based recovery." >> >> I have two doubts about this decision: >> 1) Since the event count on the recovering drive is not updated, this >> means that after reboot, when array is re-assembled, this drive will >> not be added to the array, and the user will have to manually re-add >> it. I agree this is a minor thing. > > Still, if it can be fixed it should be. > >> 2) There are places in mdadm, which check for recovery_offset on the >> drive and take decisions based upon that. Specifically, if there is >> *no* recovery offset, the data on this drive is considered to be >> consistent WRT to the failure time of the drive. So, for example, the >> drive can be a candidate for bumping up events during "force >> assembly". Now, when superblock on such drive is not updated during >> recovery (so there is *no* recovery offset), mdadm will think that the >> drive is consistent, while in fact, its data is totally unusable until >> after recovery completes. That's because we have updated parts of the >> drive, but did not complete bringing the whole drive in-sync. > > If mdadm would consider updating the event count if not recovery had started, > then surely it is just as valid to do so once some recovery has started, even > if it hasn't completed. The patch you accepted from me ("Don't consider disks with a valid recovery offset as candidates for bumping up event count") actually attempts to protect from that:) I don't understand why "it is just as valid to do so once some recovery has started". My understanding is that once recovery of a drive has started, its data is not consistent between different parts of the drive, until the recovery completes. This is because md does bitmap-based recovery, and not kind of journal/transaction-log based recovery. However, one could argue that for force-assembly case, when data anyways can come up as partially corrupted, this is less important. I would still think that there is value in recoding in a superblock that a drive is recovering. > >> >> Can you pls comment on my doubts? > > I think there is probably room for improvement here but I don't think there > are any serious problems. > > However I'm about to go on leave for a couple of week so I'm unlikely to > think about it for a while. I've made a note to look at it properly when I > get back. > Indeed, don't you think about this while you are resting! Alex. > >> >> Thanks, >> Alex. >> >> >> On Wed, Jun 6, 2012 at 7:09 PM, Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: >> > Hi Neil, >> > it looks like we are hitting an issue with this patch. >> > Kernel is stock ubuntu-precise (3.2.0-23 36), mdadm is v3.2.3. (We are >> > aware that this version of mdadm has an issue of unaligned writes, so >> > we backported the fix to it into this version). >> > >> > We are testing the following simple scenario: >> > # 2-way raid1 with a missing drive >> > # a fresh drive (no superblock) is added (not re-added) to the raid, >> > and the raid happily starts rebuilding it >> > # at some point in time, this drive fails, so rebuild aborts, drive is >> > kicked out. >> > # later the drive recovers and "mdadm remove & re-add" is attempted >> > # re-add fails >> > >> > Some debugging suggests the following: >> > >> > - The superblock of the drive to be re-added looks like this: >> > >> > Avail Dev Size : 10483712 (5.00 GiB 5.37 GB) >> > Array Size : 10483688 (5.00 GiB 5.37 GB) >> > Used Dev Size : 10483688 (5.00 GiB 5.37 GB) >> > Data Offset : 2048 sectors >> > Super Offset : 8 sectors >> > Recovery Offset : 100096 sectors >> > State : clean >> > Resync Offset : N/A sectors >> > Device UUID : d9114d02:76153894:87aaf09d:c55da9be >> > Internal Bitmap : 8 sectors from superblock >> > Update Time : Wed Jun 6 18:36:55 2012 >> > Checksum : 94468c9d - correct >> > Events : 192 >> > Device Role : Active device 1 >> > Array State : AA ('A' == active, '.' == missing) >> > >> > >> > So the drive has a valid recovery offset (MD_FEATURE_RECOVERY_OFFSET >> > set) and a valid role. >> > >> > - The "re-add" path of mdadm performs calls getinfo_super1(), which performs: >> > role = __le16_to_cpu(sb->dev_roles[__le32_to_cpu(sb->dev_number)]); >> > ... >> > info->disk.state = 6; /* active and in sync */ >> > info->disk.raid_disk = role; >> > >> > So at this point, MD_DISK_ACTIVE and MD_DISK_SYNC are set. >> > >> > - mdadm issues ADD_NEW_DISK, which arrives to add_new_disk() in the kernel >> > - the kernel performs: >> > if (!mddev->persistent) { >> > ... // not relevant to our case >> > } else >> > super_types[mddev->major_version]. >> > validate_super(mddev, rdev); >> > >> > - validate_super() performs: >> > if ((le32_to_cpu(sb->feature_map) & >> > MD_FEATURE_RECOVERY_OFFSET)) >> > rdev->recovery_offset = le64_to_cpu(sb->recovery_offset); >> > else >> > set_bit(In_sync, &rdev->flags); >> > rdev->raid_disk = role; >> > >> > So In_sync flag is not set, because we have a recovery offset. >> > >> > - Then the code introduced by the patch does: >> > if ((info->state & (1<<MD_DISK_SYNC)) && >> > (!test_bit(In_sync, &rdev->flags) || >> > rdev->raid_disk != info->raid_disk)) { >> > /* This was a hot-add request, but events doesn't >> > * match, so reject it. >> > */ >> > export_rdev(rdev); >> > return -EINVAL; >> > } >> > >> > So the first part of "OR" succeeds and we abort. >> > >> > An observation: this problem only happens with a fresh drive >> > apparently. If drive that was already In_sync fails and then is >> > re-added, kernel does not update its superblock until recovery >> > completes. So its superblock has no valid recovery_offset, and this >> > problem doesn't happen. The code that skips updating the superblock >> > was added as part of this email thread: >> > http://www.spinics.net/lists/raid/msg36275.html. However, I did not >> > dig deeper why the superblock gets updated in my case. But the >> > scenario above repros easily. >> > >> > Q1: Can you confirm that my analysis is correct? >> > Q2: Is this an expected behavior? I would assume that no, because the >> > same case for a failed drive (not a fresh drive) works fine >> > Q3: How would you recommend to handle this? >> > >> > Thanks! >> > Alex. > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html