Re: raid6 corruption after assembling with event counter difference of 1

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Wed, 22 Nov 2023 19:36:40 +0800

Hi,

在 2023/11/22 17:42, Alexander Lyakas 写道:
Hello Song Liu,

We had a raid6 with 6 drives, all drives marked as In_sync. At some
point drive in slot 5 (last drive) was marked as Faulty, due to
timeout IO error. Superblocks of all other drives got updated with
event count higher by 1. However, the Faulty drive was still not
ejected from the array by remove_and_add_spares(), probably because it
still had nr_pending. This situation was going on for 20 minutes, and

I think this is important, what kind of driver are you using for the
array? 20 minutes should be enough to let block layer timeout handle to
finish all the IO. Did you try to remove this disk manually?

the Faulty drive was still not being removed from the array. But array
continued serving writes, skipping the Faulty drive as designed.

After about 20 minutes, the machine got rebooted due to some other reason.

md_new_event is called from do_md_stop(), which means if this array
stopped, then event counter from other drives will be higher by 2, and
this problem won't exist anymore. So, I guess you array doesn't stopped
normally during reboot, and your case really is the same as crashes
while updating super_block.

There is a simple way to avoid your problem, just add event counter by
2 in md_error().

Thanks,
Kuai

After reboot, the array got assembled, and the event counter
difference was 1 between the problematic drive and all other drives.
Even count on all drives was 2834681, but on the problematic drive it
was 2834680. As a result, mdadm considered the problematic drive as
up-to-date, due to this code in mdadm[1]. Kernel also accepted such
difference of 1, as can be seen in super_1_validate() [2].

In addition, the array was marked as dirty, so RESYNC of the array
started. For raid6, to my understanding, resync re-calculates parity
blocks based on data blocks. But many data blocks on the problematic
drive were not up to date, because this drive was marked as Faulty for
20 minutes and writes to it were skipped. As a result, REYNC made the
parity blocks to match the not-up-to-date data blocks from the
problematic drive. Data on the array became unusable.

Many years ago, I asked Neil why event count difference of 1 was
allowed. He responded that this was to address the case when the
machine crashes in the middle of superblock writes, so some superblock
writes succeeded and some failed. In such case, allowing event count
difference of 1 is legitimate.

Can you please comment of whether this behavior seems correct, in
light of the scenario above?

Thanks,
Alex.

[1]
int event_margin = 1; /* always allow a difference of '1'
        * like the kernel does
        */
...
/* Require event counter to be same as, or just less than,
* most recent.  If it is bigger, it must be a stray spare and
* should be ignored.
*/
if (devices[j].i.events+event_margin >=
     devices[most_recent].i.events &&
     devices[j].i.events <=
     devices[most_recent].i.events
) {
devices[j].uptodate = 1;

[2]
} else if (mddev->pers == NULL) {
/* Insist of good event counter while assembling, except for
* spares (which don't need an event count) */
++ev1;
if (rdev->desc_nr >= 0 &&
     rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
     (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
      le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
if (ev1 < mddev->events)
return -EINVAL;

.