Hi, 在 2023/11/22 17:42, Alexander Lyakas 写道:
Hello Song Liu, We had a raid6 with 6 drives, all drives marked as In_sync. At some point drive in slot 5 (last drive) was marked as Faulty, due to timeout IO error. Superblocks of all other drives got updated with event count higher by 1. However, the Faulty drive was still not ejected from the array by remove_and_add_spares(), probably because it still had nr_pending. This situation was going on for 20 minutes, and
I think this is important, what kind of driver are you using for the array? 20 minutes should be enough to let block layer timeout handle to finish all the IO. Did you try to remove this disk manually?
the Faulty drive was still not being removed from the array. But array continued serving writes, skipping the Faulty drive as designed. After about 20 minutes, the machine got rebooted due to some other reason.
md_new_event is called from do_md_stop(), which means if this array stopped, then event counter from other drives will be higher by 2, and this problem won't exist anymore. So, I guess you array doesn't stopped normally during reboot, and your case really is the same as crashes while updating super_block. There is a simple way to avoid your problem, just add event counter by 2 in md_error(). Thanks, Kuai
After reboot, the array got assembled, and the event counter difference was 1 between the problematic drive and all other drives. Even count on all drives was 2834681, but on the problematic drive it was 2834680. As a result, mdadm considered the problematic drive as up-to-date, due to this code in mdadm[1]. Kernel also accepted such difference of 1, as can be seen in super_1_validate() [2]. In addition, the array was marked as dirty, so RESYNC of the array started. For raid6, to my understanding, resync re-calculates parity blocks based on data blocks. But many data blocks on the problematic drive were not up to date, because this drive was marked as Faulty for 20 minutes and writes to it were skipped. As a result, REYNC made the parity blocks to match the not-up-to-date data blocks from the problematic drive. Data on the array became unusable. Many years ago, I asked Neil why event count difference of 1 was allowed. He responded that this was to address the case when the machine crashes in the middle of superblock writes, so some superblock writes succeeded and some failed. In such case, allowing event count difference of 1 is legitimate. Can you please comment of whether this behavior seems correct, in light of the scenario above? Thanks, Alex. [1] int event_margin = 1; /* always allow a difference of '1' * like the kernel does */ ... /* Require event counter to be same as, or just less than, * most recent. If it is bigger, it must be a stray spare and * should be ignored. */ if (devices[j].i.events+event_margin >= devices[most_recent].i.events && devices[j].i.events <= devices[most_recent].i.events ) { devices[j].uptodate = 1; [2] } else if (mddev->pers == NULL) { /* Insist of good event counter while assembling, except for * spares (which don't need an event count) */ ++ev1; if (rdev->desc_nr >= 0 && rdev->desc_nr < le32_to_cpu(sb->max_dev) && (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX || le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL)) if (ev1 < mddev->events) return -EINVAL; .