Re: raid6 corruption after assembling with event counter difference of 1

Song Liu <song@xxxxxxxxxx> · Sun, 26 Nov 2023 08:07:46 -0800

On Sun, Nov 26, 2023 at 1:18 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote:
>
> Hi Song,
>
> Thank you for your response.
>
> On Fri, Nov 24, 2023 at 1:58 AM Song Liu <song@xxxxxxxxxx> wrote:
> >
> > Hi Alexander,
> >
> > Thanks for the report.
> >
> > On Wed, Nov 22, 2023 at 1:42 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote:
> > >
> > > Hello Song Liu,
> > >
> > > We had a raid6 with 6 drives, all drives marked as In_sync. At some
> > > point drive in slot 5 (last drive) was marked as Faulty, due to
> > > timeout IO error. Superblocks of all other drives got updated with
> > > event count higher by 1. However, the Faulty drive was still not
> > > ejected from the array by remove_and_add_spares(), probably because it
> > > still had nr_pending. This situation was going on for 20 minutes, and
> > > the Faulty drive was still not being removed from the array. But array
> > > continued serving writes, skipping the Faulty drive as designed.
> > >
> > > After about 20 minutes, the machine got rebooted due to some other reason.
> > >
> > > After reboot, the array got assembled, and the event counter
> > > difference was 1 between the problematic drive and all other drives.
> > > Even count on all drives was 2834681, but on the problematic drive it
> > > was 2834680. As a result, mdadm considered the problematic drive as
> > > up-to-date, due to this code in mdadm[1]. Kernel also accepted such
> > > difference of 1, as can be seen in super_1_validate() [2].
> >
> > Did super_1_validate() set the bad drive as Faulty?
> >
> >                 switch(role) {
> >                 case MD_DISK_ROLE_SPARE: /* spare */
> >                         break;
> >                 case MD_DISK_ROLE_FAULTY: /* faulty */
> >                         set_bit(Faulty, &rdev->flags);
> >                         break;
> >
> > AFAICT, this should stop the bad drive from going bad to the array.
> > If this doesn't work, there might be some bug somewhere.
> This can never happen. The reason is that a superblock of a particular
> device will never record about *itself* that it is Faulty. When a
> device fails, we mark it as Faulty in memory, and then md_update_sb()
> will skip updating the superblock of this particular device. I have
> discussed this also many years ago with Neil[1].

Thanks for pointing this out. I think this is the actual problem here. In
analyze_sbs() we first run validate_super() on the freshest device.
Then, we run validate_super() on other devices. We should mark the
device as faulty based on the sb from the freshest device. (which is
not the case at the moment). I think we need to fix this.

> Can you please comment whether you believe that treating event counter
> difference 1 as "still up to date" is safe?

With the above issue fixed, allowing the event counter to be off by 1 is safe.
Does this make sense? Did I miss cases?

Thanks,
Song

> Thanks,
> Alex.
>
>
>
> [1]
> Quoting from https://www.spinics.net/lists/raid/msg37003.html
> Question:
> When a drive fails, the kernel skips updating its superblock, and
> updates all other superblocks that this drive is Faulty. How can it
> happen that a drive can mark itself as Faulty in its own superblock? I
> saw code in mdadm checking for this.
> Answer:
> It cannot, as you say. I don't remember why mdadm checks for that.
> Maybe a very old version of the kernel code could do that.