Re: raid6 corruption after assembling with event counter difference of 1

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Sun, 26 Nov 2023 11:17:59 +0200

Hi Song,

Thank you for your response.

On Fri, Nov 24, 2023 at 1:58 AM Song Liu <song@xxxxxxxxxx> wrote:
>
> Hi Alexander,
>
> Thanks for the report.
>
> On Wed, Nov 22, 2023 at 1:42 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote:
> >
> > Hello Song Liu,
> >
> > We had a raid6 with 6 drives, all drives marked as In_sync. At some
> > point drive in slot 5 (last drive) was marked as Faulty, due to
> > timeout IO error. Superblocks of all other drives got updated with
> > event count higher by 1. However, the Faulty drive was still not
> > ejected from the array by remove_and_add_spares(), probably because it
> > still had nr_pending. This situation was going on for 20 minutes, and
> > the Faulty drive was still not being removed from the array. But array
> > continued serving writes, skipping the Faulty drive as designed.
> >
> > After about 20 minutes, the machine got rebooted due to some other reason.
> >
> > After reboot, the array got assembled, and the event counter
> > difference was 1 between the problematic drive and all other drives.
> > Even count on all drives was 2834681, but on the problematic drive it
> > was 2834680. As a result, mdadm considered the problematic drive as
> > up-to-date, due to this code in mdadm[1]. Kernel also accepted such
> > difference of 1, as can be seen in super_1_validate() [2].
>
> Did super_1_validate() set the bad drive as Faulty?
>
>                 switch(role) {
>                 case MD_DISK_ROLE_SPARE: /* spare */
>                         break;
>                 case MD_DISK_ROLE_FAULTY: /* faulty */
>                         set_bit(Faulty, &rdev->flags);
>                         break;
>
> AFAICT, this should stop the bad drive from going bad to the array.
> If this doesn't work, there might be some bug somewhere.
This can never happen. The reason is that a superblock of a particular
device will never record about *itself* that it is Faulty. When a
device fails, we mark it as Faulty in memory, and then md_update_sb()
will skip updating the superblock of this particular device. I have
discussed this also many years ago with Neil[1].

Can you please comment whether you believe that treating event counter
difference 1 as "still up to date" is safe?

Thanks,
Alex.

[1]
Quoting from https://www.spinics.net/lists/raid/msg37003.html
Question:
When a drive fails, the kernel skips updating its superblock, and
updates all other superblocks that this drive is Faulty. How can it
happen that a drive can mark itself as Faulty in its own superblock? I
saw code in mdadm checking for this.
Answer:
It cannot, as you say. I don't remember why mdadm checks for that.
Maybe a very old version of the kernel code could do that.