Re: Help recovering a RAID5, what seems to be a strange state

Roy Sigurd Karlsbakk <roy@xxxxxxxxxxxxx> · Mon, 4 Jul 2022 19:46:03 +0200 (CEST)

Have you tried to do a resync or repair of the raid? I've written a bit about that here

https://wiki.karlsbakk.net/index.php/Roy's_notes#Resync

I'd suggest 'repair', since that tends to fix things.

PS: If you don't have a backup, make one first. NEVER beleive a raid is backup, please ;)

Vennlig hilsen

roy
-- 
Roy Sigurd Karlsbakk
(+47) 98013356
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
Hið góða skaltu í stein höggva, hið illa í snjó rita.

----- Original Message -----
> From: "Von Fugal" <von@xxxxxxxxx>
> To: "Linux Raid" <linux-raid@xxxxxxxxxxxxxxx>
> Sent: Monday, 4 July, 2022 19:41:27
> Subject: Re: Help recovering a RAID5, what seems to be a strange state

> I did get the array to reassemble. It's still strange to me having all
> devices removed, but then listed again. Incremental adds always
> resulted in the bad state, but what finally assembled the array was
> "mdadm -A --force /dev/md51" started from the array stopped and
> without any incremental adds.
> 
> It's still doing recovery but it looks good. I may follow up on this
> thread again if it goes south.
> 
> Cheers!
> 
> On Sun, Jul 3, 2022 at 3:57 PM Von Fugal <von@xxxxxxxxx> wrote:
>>
>> Tl;Dr version:
>> I restored partition tables with different end sectors initially.
>> Started raids to ill effect. Restored correct partition tables and
>> things seemed OK but degraded until they weren't.
>>
>> Current state is 3 devices with the same event numbers, but the raid
>> is "dirty" and cannot start degraded *and* dirty. I know the array
>> initially ran with sd[abd]4 and I added the "missing" sdc4 whence it
>> did something strange while attempting to resync.
>>
>> sdc4 is now a "spare" but cannot be added after an attempted
>> incremental run with the other 3. Either way, after trying to run the
>> array, the table from 'mdadm -D' looks similar to this:
>>
>>     Number   Major   Minor   RaidDevice State
>>        -       0        0        0      removed
>>        -       0        0        1      removed
>>        -       0        0        2      removed
>>        -       0        0        3      removed
>>
>>        -       8       52        2      sync   /dev/sdd4
>>        -       8       36        -      spare   /dev/sdc4
>>        -       8       20        0      sync   /dev/sdb4
>>        -       8        4        1      sync   /dev/sda4
>>
>> Long story version follows
>>
>> I have 4 drives partitioned into different raid types. partition 4 is
>> a raid5 across all 4 drives. For some reason my gpt partition tables
>> were all wiped, and I suspect benchmarking with fio (though I only
>> ever gave it an lvm volume to operate on). I boot systemrescuecd and
>> testdisk finds the original partitions so I tell it to restore those.
>> So far seems good. I start assembling some arrays, others don't work
>> yet. lvm is starting to show contents it finds in the so far assembled
>> arrays (this is still within systemrescuecd).
>>
>> Investigating the unassembled arrays, dmesg is complaining about the
>> array size changed. I find a suggestion to use "-U devicesize". I
>> believe this was my first mistake. The arrays assemble but lvm hangs
>> indefinitely at this point.
>>
>> I desperately search for any info I have on the partitions and arrays
>> and I find a spreadsheet on my laptop that contains meticulous
>> partition detail. I find that some of the partition ends leave a gap
>> before the next partition begins. Whatever. I fix the partition
>> tables. This time, all the arrays assemble and lvm is happy!! YES.
>>
>> However each array has one missing partition member and it's not the
>> same disk on each. That's strange. However my server is running. I'm
>> able to boot it normally and homeassistant is back up. I then re-add
>> each missing partition to each array (I believe this was my second
>> mistake). I go to bed while it reconstructs.
>>
>> In the morning, the array it was reconstructing is back into pending,
>> the raid5 array in question is inactive, and it's reconstructing
>> something else. I remove each partition that I previously added to
>> each array (although the array in question doesn't even let me do
>> this) . I stop the array in question and zero the superblock of the
>> partition I wanted to remove. I zero the superblocks on each other
>> partition removed. I then re-add each partition to each array and let
>> them resync. I now have 3 out of 5 fully operational, one more resync
>> in progress.
>>
>> But my array in question is still kinda hosed. Here's where it's
>> strange. Rather than explain everything, here's the status from the
>> devices (mdadm -E) and the array (mdadm -D).
>> https://pastebin.com/Gyj8d7Z7
>>
>> Note the table at the end of mdadm -D (end of the paste). It shows
>> four devices "removed", then a gap, then 3 devices as 'sync' . If I
>> incrementally add the drives it shows a "normal" table. Until I try
>> --run, then it shows the odd table. If I add --incremental 3 drives
>> (not the 'spare') then run, it shows the pasted table. If I try to add
>> the fourth (spare) it says "ADD_NEW_DISK not supported" in dmesg. If I
>> add 3 drives including the 'spare' it's the same behavior otherwise,
>> but adding the fourth drive complains that it can only add it as a
>> spare, and I must use force-spare to add it (I suspect this would be
>> my 3rd mistake if I did it).
>>
>> I think I can force run this array with sd[abd]4 but the normal
>> commands give errors when trying to do so. What's also strange is that
>> devices sd[abd]4 all have the same event count, yet trying to start
>> the array results in "cannot start dirty degraded array".
> 
> 
> 
> --
> You keep up the good fight just as long as you feel you need to.
> -- Ken Danagger