Re: raid6 corruption after assembling with event counter difference of 1

Song Liu <song@xxxxxxxxxx> · Thu, 23 Nov 2023 15:58:09 -0800

Hi Alexander,

Thanks for the report.

On Wed, Nov 22, 2023 at 1:42 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote:
>
> Hello Song Liu,
>
> We had a raid6 with 6 drives, all drives marked as In_sync. At some
> point drive in slot 5 (last drive) was marked as Faulty, due to
> timeout IO error. Superblocks of all other drives got updated with
> event count higher by 1. However, the Faulty drive was still not
> ejected from the array by remove_and_add_spares(), probably because it
> still had nr_pending. This situation was going on for 20 minutes, and
> the Faulty drive was still not being removed from the array. But array
> continued serving writes, skipping the Faulty drive as designed.
>
> After about 20 minutes, the machine got rebooted due to some other reason.
>
> After reboot, the array got assembled, and the event counter
> difference was 1 between the problematic drive and all other drives.
> Even count on all drives was 2834681, but on the problematic drive it
> was 2834680. As a result, mdadm considered the problematic drive as
> up-to-date, due to this code in mdadm[1]. Kernel also accepted such
> difference of 1, as can be seen in super_1_validate() [2].

Did super_1_validate() set the bad drive as Faulty?

                switch(role) {
                case MD_DISK_ROLE_SPARE: /* spare */
                        break;
                case MD_DISK_ROLE_FAULTY: /* faulty */
                        set_bit(Faulty, &rdev->flags);
                        break;

AFAICT, this should stop the bad drive from going bad to the array.
If this doesn't work, there might be some bug somewhere.

Thanks,
Song

>
> In addition, the array was marked as dirty, so RESYNC of the array
> started. For raid6, to my understanding, resync re-calculates parity
> blocks based on data blocks. But many data blocks on the problematic
> drive were not up to date, because this drive was marked as Faulty for
> 20 minutes and writes to it were skipped. As a result, REYNC made the
> parity blocks to match the not-up-to-date data blocks from the
> problematic drive. Data on the array became unusable.
>
> Many years ago, I asked Neil why event count difference of 1 was
> allowed. He responded that this was to address the case when the
> machine crashes in the middle of superblock writes, so some superblock
> writes succeeded and some failed. In such case, allowing event count
> difference of 1 is legitimate.
>
> Can you please comment of whether this behavior seems correct, in
> light of the scenario above?
>
> Thanks,
> Alex.
>
> [1]
> int event_margin = 1; /* always allow a difference of '1'
>        * like the kernel does
>        */
> ...
> /* Require event counter to be same as, or just less than,
> * most recent.  If it is bigger, it must be a stray spare and
> * should be ignored.
> */
> if (devices[j].i.events+event_margin >=
>     devices[most_recent].i.events &&
>     devices[j].i.events <=
>     devices[most_recent].i.events
> ) {
> devices[j].uptodate = 1;
>
> [2]
> } else if (mddev->pers == NULL) {
> /* Insist of good event counter while assembling, except for
> * spares (which don't need an event count) */
> ++ev1;
> if (rdev->desc_nr >= 0 &&
>     rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
>     (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
>      le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
> if (ev1 < mddev->events)
> return -EINVAL;