Re: RAID6 12 device assemble force failure

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Wed, 3 Jul 2024 09:42:53 +0200

On Tue, 2 Jul 2024 19:47:52 +0200
Adam Niescierowicz <adam.niescierowicz@xxxxxxxxxx> wrote:

> >>>> What can I do to start this array?  
> >>>    You may try to add them manually. I know that there is
> >>> --re-add functionality but I've never used it. Maybe something like that
> >>> would
> >>> work:
> >>> #mdadm --remove /dev/md126 <failed drive>
> >>> #mdadm --re-add /dev/md126 <failed_drive>  
> >> I tried this but didn't help.  
> > Please provide a logs then (possibly with -vvvvv) maybe I or someone else
> > would help.  
> 
> Logs
> ---
> 
> # mdadm --run -vvvvv /dev/md126
> mdadm: failed to start array /dev/md/card1pport2chassis1: Input/output error
> 
> # mdadm --stop /dev/md126
> mdadm: stopped /dev/md126
> 
> # mdadm --assemble --force -vvvvv /dev/md126 /dev/sdq1 /dev/sdv1 
> /dev/sdr1 /dev/sdu1 /dev/sdz1 /dev/sdx1 /dev/sdk1 /dev/sds1 /dev/sdm1 
> /dev/sdn1 /dev/sdw1 /dev/sdt1
> mdadm: looking for devices for /dev/md126
> mdadm: /dev/sdq1 is identified as a member of /dev/md126, slot -1.
> mdadm: /dev/sdv1 is identified as a member of /dev/md126, slot 1.
> mdadm: /dev/sdr1 is identified as a member of /dev/md126, slot 6.
> mdadm: /dev/sdu1 is identified as a member of /dev/md126, slot -1.
> mdadm: /dev/sdz1 is identified as a member of /dev/md126, slot 11.
> mdadm: /dev/sdx1 is identified as a member of /dev/md126, slot 9.
> mdadm: /dev/sdk1 is identified as a member of /dev/md126, slot -1.
> mdadm: /dev/sds1 is identified as a member of /dev/md126, slot 7.
> mdadm: /dev/sdm1 is identified as a member of /dev/md126, slot 3.
> mdadm: /dev/sdn1 is identified as a member of /dev/md126, slot 2.
> mdadm: /dev/sdw1 is identified as a member of /dev/md126, slot 4.
> mdadm: /dev/sdt1 is identified as a member of /dev/md126, slot 0.
> mdadm: added /dev/sdv1 to /dev/md126 as 1
> mdadm: added /dev/sdn1 to /dev/md126 as 2
> mdadm: added /dev/sdm1 to /dev/md126 as 3
> mdadm: added /dev/sdw1 to /dev/md126 as 4
> mdadm: no uptodate device for slot 5 of /dev/md126
> mdadm: added /dev/sdr1 to /dev/md126 as 6
> mdadm: added /dev/sds1 to /dev/md126 as 7
> mdadm: no uptodate device for slot 8 of /dev/md126
> mdadm: added /dev/sdx1 to /dev/md126 as 9
> mdadm: no uptodate device for slot 10 of /dev/md126
> mdadm: added /dev/sdz1 to /dev/md126 as 11
> mdadm: added /dev/sdq1 to /dev/md126 as -1
> mdadm: added /dev/sdu1 to /dev/md126 as -1
> mdadm: added /dev/sdk1 to /dev/md126 as -1
> mdadm: added /dev/sdt1 to /dev/md126 as 0
> mdadm: /dev/md126 assembled from 9 drives and 3 spares - not enough to 
> start the array.
> ---

Could you please share the logs with from --re-add attempt? In a meantime I
will try to simulate this scenario.
> 
> Can somebody explain me behavior of the array? (theory)
> 
> This is RAID-6 so after two disk are disconnected it still works fine. 
> Next when third disk disconnect the array should stop as faulty, yes?
> If array stop as faulty the data on array and third disconnected disk 
> should be the same, yes?

If you will recover only one drive (and start double degraded array), it may
lead to RWH (raid write hole).

If there were writes during disks failure, we don't know which in flight
requests completed. The XOR based calculations may leads us to improper results
for some sectors (we need to read all disks and XOR the data to get the data
for missing 2 drives).

But.. if you will add again all disks, in worst case we will read outdated data
and your filesystem should be able to recover from it.

So yes, it should be fine if you will start array with all drives.

Mariusz