raid6 corruption after assembling with event counter difference of 1

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 22 Nov 2023 11:42:09 +0200

Hello Song Liu,

We had a raid6 with 6 drives, all drives marked as In_sync. At some
point drive in slot 5 (last drive) was marked as Faulty, due to
timeout IO error. Superblocks of all other drives got updated with
event count higher by 1. However, the Faulty drive was still not
ejected from the array by remove_and_add_spares(), probably because it
still had nr_pending. This situation was going on for 20 minutes, and
the Faulty drive was still not being removed from the array. But array
continued serving writes, skipping the Faulty drive as designed.

After about 20 minutes, the machine got rebooted due to some other reason.

After reboot, the array got assembled, and the event counter
difference was 1 between the problematic drive and all other drives.
Even count on all drives was 2834681, but on the problematic drive it
was 2834680. As a result, mdadm considered the problematic drive as
up-to-date, due to this code in mdadm[1]. Kernel also accepted such
difference of 1, as can be seen in super_1_validate() [2].

In addition, the array was marked as dirty, so RESYNC of the array
started. For raid6, to my understanding, resync re-calculates parity
blocks based on data blocks. But many data blocks on the problematic
drive were not up to date, because this drive was marked as Faulty for
20 minutes and writes to it were skipped. As a result, REYNC made the
parity blocks to match the not-up-to-date data blocks from the
problematic drive. Data on the array became unusable.

Many years ago, I asked Neil why event count difference of 1 was
allowed. He responded that this was to address the case when the
machine crashes in the middle of superblock writes, so some superblock
writes succeeded and some failed. In such case, allowing event count
difference of 1 is legitimate.

Can you please comment of whether this behavior seems correct, in
light of the scenario above?

Thanks,
Alex.

[1]
int event_margin = 1; /* always allow a difference of '1'
       * like the kernel does
       */
...
/* Require event counter to be same as, or just less than,
* most recent.  If it is bigger, it must be a stray spare and
* should be ignored.
*/
if (devices[j].i.events+event_margin >=
    devices[most_recent].i.events &&
    devices[j].i.events <=
    devices[most_recent].i.events
) {
devices[j].uptodate = 1;

[2]
} else if (mddev->pers == NULL) {
/* Insist of good event counter while assembling, except for
* spares (which don't need an event count) */
++ev1;
if (rdev->desc_nr >= 0 &&
    rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
    (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
     le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
if (ev1 < mddev->events)
return -EINVAL;