Re: raid6 corruption after assembling with event counter difference of 1

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 22 Nov 2023 14:03:09 +0200

Hi Yu Kuai,

On Wed, Nov 22, 2023 at 1:36 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2023/11/22 17:42, Alexander Lyakas 写道:
> > Hello Song Liu,
> >
> > We had a raid6 with 6 drives, all drives marked as In_sync. At some
> > point drive in slot 5 (last drive) was marked as Faulty, due to
> > timeout IO error. Superblocks of all other drives got updated with
> > event count higher by 1. However, the Faulty drive was still not
> > ejected from the array by remove_and_add_spares(), probably because it
> > still had nr_pending. This situation was going on for 20 minutes, and
>
> I think this is important, what kind of driver are you using for the
> array? 20 minutes should be enough to let block layer timeout handle to
> finish all the IO. Did you try to remove this disk manually?
I am also not sure why the disk was not ejected from the array for 20
minutes. Removing the disk manually is not possible, as long as md
holds the disk as rdev. Only when md ejects the disk from the array
(via remove_and_add_spares()), then it is possible to remove the disk
from md (mdadm --manage --remove). Anyways, all this was happening on
a production system, without human attendance at that time.

>
> > the Faulty drive was still not being removed from the array. But array
> > continued serving writes, skipping the Faulty drive as designed.
> >
> > After about 20 minutes, the machine got rebooted due to some other reason.
>
> md_new_event is called from do_md_stop(), which means if this array
> stopped, then event counter from other drives will be higher by 2, and
> this problem won't exist anymore. So, I guess you array doesn't stopped
> normally during reboot, and your case really is the same as crashes
> while updating super_block.
Correct. Array was not stopped (via madam --stop). Some of the higher
level scripts decided to hard-reboot the machine.

>
> There is a simple way to avoid your problem, just add event counter by
> 2 in md_error().
>
> Thanks,
> Kuai
> >
> > After reboot, the array got assembled, and the event counter
> > difference was 1 between the problematic drive and all other drives.
> > Even count on all drives was 2834681, but on the problematic drive it
> > was 2834680. As a result, mdadm considered the problematic drive as
> > up-to-date, due to this code in mdadm[1]. Kernel also accepted such
> > difference of 1, as can be seen in super_1_validate() [2].
> >
> > In addition, the array was marked as dirty, so RESYNC of the array
> > started. For raid6, to my understanding, resync re-calculates parity
> > blocks based on data blocks. But many data blocks on the problematic
> > drive were not up to date, because this drive was marked as Faulty for
> > 20 minutes and writes to it were skipped. As a result, REYNC made the
> > parity blocks to match the not-up-to-date data blocks from the
> > problematic drive. Data on the array became unusable.
> >
> > Many years ago, I asked Neil why event count difference of 1 was
> > allowed. He responded that this was to address the case when the
> > machine crashes in the middle of superblock writes, so some superblock
> > writes succeeded and some failed. In such case, allowing event count
> > difference of 1 is legitimate.
> >
> > Can you please comment of whether this behavior seems correct, in
> > light of the scenario above?
> >
> > Thanks,
> > Alex.
> >
> > [1]
> > int event_margin = 1; /* always allow a difference of '1'
> >         * like the kernel does
> >         */
> > ...
> > /* Require event counter to be same as, or just less than,
> > * most recent.  If it is bigger, it must be a stray spare and
> > * should be ignored.
> > */
> > if (devices[j].i.events+event_margin >=
> >      devices[most_recent].i.events &&
> >      devices[j].i.events <=
> >      devices[most_recent].i.events
> > ) {
> > devices[j].uptodate = 1;
> >
> > [2]
> > } else if (mddev->pers == NULL) {
> > /* Insist of good event counter while assembling, except for
> > * spares (which don't need an event count) */
> > ++ev1;
> > if (rdev->desc_nr >= 0 &&
> >      rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
> >      (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
> >       le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
> > if (ev1 < mddev->events)
> > return -EINVAL;
> >
> >
> > .
> >
>
>