Re: RAID6 12 device assemble force failure

Mariusz Tkaczyk <mariusz.tkaczyk@xxxxxxxxxxxxxxx> · Mon, 1 Jul 2024 10:51:53 +0200

Hello Adam,
I hope you have backup! Citation from raid wiki linked below:

"Remember, RAID is not a backup! If you lose redundancy, you need to take a
backup!"

I'm not native raid expert but I will try to give you some clues.

On Sat, 29 Jun 2024 17:17:54 +0200
Adam Niescierowicz <adam.niescierowicz@xxxxxxxxxx> wrote:

> Hi,
> 
> i have raid 6 array on 12 disk attached via external SAS backplane 
> connected by 4 luns to the server. After some problems with backplane 
> when 3 disk went offline (in one second) and array stop.
> 

And raid is considered as failed by mdadm and it is persistent with state of
the devices in metadata.

>     Device Role : spare
>     Array State : AAAAA.AA.A.A ('A' == active, '.' == missing, 'R' == 
> replacing)

3 missing = failed raid 6 array.

> I think the problem is that disk are recognised as spare, but why?

Because mdadm cannot trust them because they are reported as "missing" so they
are not configured as raid devices (spare is default state).

> I tried with `mdadm --assemble --force --update=force-no-bbl 

It remove badblocks but not revert devices from "missing" to "active".

> /dev/sd{q,p,o,n,m,z,y,z,w,t,s,r}1` and now mdam -E shows
> 
> 
> ---
> 
>            Magic : a92b4efc
>          Version : 1.2
>      Feature Map : 0x1
>       Array UUID : f8fb0d5d:5cacae2e:12bf1656:18264fb5
>             Name : backup:card1port1chassis2
>    Creation Time : Tue Jun 18 20:07:19 2024
>       Raid Level : raid6
>     Raid Devices : 12
> 
>   Avail Dev Size : 39063382016 sectors (18.19 TiB 20.00 TB)
>       Array Size : 195316910080 KiB (181.90 TiB 200.00 TB)
>      Data Offset : 264192 sectors
>     Super Offset : 8 sectors
>     Unused Space : before=264104 sectors, after=0 sectors
>            State : clean
>      Device UUID : e726c6bc:11415fcc:49e8e0a5:041b69e4
> 
> Internal Bitmap : 8 sectors from superblock
>      Update Time : Fri Jun 28 22:21:57 2024
>         Checksum : 9ad1554c - correct
>           Events : 48640
> 
>           Layout : left-symmetric
>       Chunk Size : 512K
> 
>     Device Role : spare
>     Array State : AAAAA.AA.A.A ('A' == active, '.' == missing, 'R' == 
> replacing)
> ---
> 
> 
> What can I do to start this array?

 You may try to add them manually. I know that there is
--re-add functionality but I've never used it. Maybe something like that would
work:
#mdadm --remove /dev/md126 <failed drive>
#mdadm --re-add /dev/md126 <failed_drive>

If you will recover one drive this way, array should start but data might be not
consistent, please be aware of that!

Drive should be restored to sync state in details.

I highly advice you to simulate this scenario on not infrastructure
critical setup. As i said, I'm not a native raid expert.

for more suggestions see:
https://raid.wiki.kernel.org/index.php/Replacing_a_failed_drive

Thanks,
Mariusz