Hi Alexander, Thanks for the report. On Wed, Nov 22, 2023 at 1:42 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > > Hello Song Liu, > > We had a raid6 with 6 drives, all drives marked as In_sync. At some > point drive in slot 5 (last drive) was marked as Faulty, due to > timeout IO error. Superblocks of all other drives got updated with > event count higher by 1. However, the Faulty drive was still not > ejected from the array by remove_and_add_spares(), probably because it > still had nr_pending. This situation was going on for 20 minutes, and > the Faulty drive was still not being removed from the array. But array > continued serving writes, skipping the Faulty drive as designed. > > After about 20 minutes, the machine got rebooted due to some other reason. > > After reboot, the array got assembled, and the event counter > difference was 1 between the problematic drive and all other drives. > Even count on all drives was 2834681, but on the problematic drive it > was 2834680. As a result, mdadm considered the problematic drive as > up-to-date, due to this code in mdadm[1]. Kernel also accepted such > difference of 1, as can be seen in super_1_validate() [2]. Did super_1_validate() set the bad drive as Faulty? switch(role) { case MD_DISK_ROLE_SPARE: /* spare */ break; case MD_DISK_ROLE_FAULTY: /* faulty */ set_bit(Faulty, &rdev->flags); break; AFAICT, this should stop the bad drive from going bad to the array. If this doesn't work, there might be some bug somewhere. Thanks, Song > > In addition, the array was marked as dirty, so RESYNC of the array > started. For raid6, to my understanding, resync re-calculates parity > blocks based on data blocks. But many data blocks on the problematic > drive were not up to date, because this drive was marked as Faulty for > 20 minutes and writes to it were skipped. As a result, REYNC made the > parity blocks to match the not-up-to-date data blocks from the > problematic drive. Data on the array became unusable. > > Many years ago, I asked Neil why event count difference of 1 was > allowed. He responded that this was to address the case when the > machine crashes in the middle of superblock writes, so some superblock > writes succeeded and some failed. In such case, allowing event count > difference of 1 is legitimate. > > Can you please comment of whether this behavior seems correct, in > light of the scenario above? > > Thanks, > Alex. > > [1] > int event_margin = 1; /* always allow a difference of '1' > * like the kernel does > */ > ... > /* Require event counter to be same as, or just less than, > * most recent. If it is bigger, it must be a stray spare and > * should be ignored. > */ > if (devices[j].i.events+event_margin >= > devices[most_recent].i.events && > devices[j].i.events <= > devices[most_recent].i.events > ) { > devices[j].uptodate = 1; > > [2] > } else if (mddev->pers == NULL) { > /* Insist of good event counter while assembling, except for > * spares (which don't need an event count) */ > ++ev1; > if (rdev->desc_nr >= 0 && > rdev->desc_nr < le32_to_cpu(sb->max_dev) && > (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX || > le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL)) > if (ev1 < mddev->events) > return -EINVAL;