On Sun, Nov 26, 2023 at 1:18 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > > Hi Song, > > Thank you for your response. > > On Fri, Nov 24, 2023 at 1:58 AM Song Liu <song@xxxxxxxxxx> wrote: > > > > Hi Alexander, > > > > Thanks for the report. > > > > On Wed, Nov 22, 2023 at 1:42 AM Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > > > > > > Hello Song Liu, > > > > > > We had a raid6 with 6 drives, all drives marked as In_sync. At some > > > point drive in slot 5 (last drive) was marked as Faulty, due to > > > timeout IO error. Superblocks of all other drives got updated with > > > event count higher by 1. However, the Faulty drive was still not > > > ejected from the array by remove_and_add_spares(), probably because it > > > still had nr_pending. This situation was going on for 20 minutes, and > > > the Faulty drive was still not being removed from the array. But array > > > continued serving writes, skipping the Faulty drive as designed. > > > > > > After about 20 minutes, the machine got rebooted due to some other reason. > > > > > > After reboot, the array got assembled, and the event counter > > > difference was 1 between the problematic drive and all other drives. > > > Even count on all drives was 2834681, but on the problematic drive it > > > was 2834680. As a result, mdadm considered the problematic drive as > > > up-to-date, due to this code in mdadm[1]. Kernel also accepted such > > > difference of 1, as can be seen in super_1_validate() [2]. > > > > Did super_1_validate() set the bad drive as Faulty? > > > > switch(role) { > > case MD_DISK_ROLE_SPARE: /* spare */ > > break; > > case MD_DISK_ROLE_FAULTY: /* faulty */ > > set_bit(Faulty, &rdev->flags); > > break; > > > > AFAICT, this should stop the bad drive from going bad to the array. > > If this doesn't work, there might be some bug somewhere. > This can never happen. The reason is that a superblock of a particular > device will never record about *itself* that it is Faulty. When a > device fails, we mark it as Faulty in memory, and then md_update_sb() > will skip updating the superblock of this particular device. I have > discussed this also many years ago with Neil[1]. Thanks for pointing this out. I think this is the actual problem here. In analyze_sbs() we first run validate_super() on the freshest device. Then, we run validate_super() on other devices. We should mark the device as faulty based on the sb from the freshest device. (which is not the case at the moment). I think we need to fix this. > Can you please comment whether you believe that treating event counter > difference 1 as "still up to date" is safe? With the above issue fixed, allowing the event counter to be off by 1 is safe. Does this make sense? Did I miss cases? Thanks, Song > Thanks, > Alex. > > > > [1] > Quoting from https://www.spinics.net/lists/raid/msg37003.html > Question: > When a drive fails, the kernel skips updating its superblock, and > updates all other superblocks that this drive is Faulty. How can it > happen that a drive can mark itself as Faulty in its own superblock? I > saw code in mdadm checking for this. > Answer: > It cannot, as you say. I don't remember why mdadm checks for that. > Maybe a very old version of the kernel code could do that.