Re: Help recovering a RAID5, what seems to be a strange state

Von Fugal <von@xxxxxxxxx> · Mon, 4 Jul 2022 11:41:27 -0600

I did get the array to reassemble. It's still strange to me having all
devices removed, but then listed again. Incremental adds always
resulted in the bad state, but what finally assembled the array was
"mdadm -A --force /dev/md51" started from the array stopped and
without any incremental adds.

It's still doing recovery but it looks good. I may follow up on this
thread again if it goes south.

Cheers!

On Sun, Jul 3, 2022 at 3:57 PM Von Fugal <von@xxxxxxxxx> wrote:
>
> Tl;Dr version:
> I restored partition tables with different end sectors initially.
> Started raids to ill effect. Restored correct partition tables and
> things seemed OK but degraded until they weren't.
>
> Current state is 3 devices with the same event numbers, but the raid
> is "dirty" and cannot start degraded *and* dirty. I know the array
> initially ran with sd[abd]4 and I added the "missing" sdc4 whence it
> did something strange while attempting to resync.
>
> sdc4 is now a "spare" but cannot be added after an attempted
> incremental run with the other 3. Either way, after trying to run the
> array, the table from 'mdadm -D' looks similar to this:
>
>     Number   Major   Minor   RaidDevice State
>        -       0        0        0      removed
>        -       0        0        1      removed
>        -       0        0        2      removed
>        -       0        0        3      removed
>
>        -       8       52        2      sync   /dev/sdd4
>        -       8       36        -      spare   /dev/sdc4
>        -       8       20        0      sync   /dev/sdb4
>        -       8        4        1      sync   /dev/sda4
>
> Long story version follows
>
> I have 4 drives partitioned into different raid types. partition 4 is
> a raid5 across all 4 drives. For some reason my gpt partition tables
> were all wiped, and I suspect benchmarking with fio (though I only
> ever gave it an lvm volume to operate on). I boot systemrescuecd and
> testdisk finds the original partitions so I tell it to restore those.
> So far seems good. I start assembling some arrays, others don't work
> yet. lvm is starting to show contents it finds in the so far assembled
> arrays (this is still within systemrescuecd).
>
> Investigating the unassembled arrays, dmesg is complaining about the
> array size changed. I find a suggestion to use "-U devicesize". I
> believe this was my first mistake. The arrays assemble but lvm hangs
> indefinitely at this point.
>
> I desperately search for any info I have on the partitions and arrays
> and I find a spreadsheet on my laptop that contains meticulous
> partition detail. I find that some of the partition ends leave a gap
> before the next partition begins. Whatever. I fix the partition
> tables. This time, all the arrays assemble and lvm is happy!! YES.
>
> However each array has one missing partition member and it's not the
> same disk on each. That's strange. However my server is running. I'm
> able to boot it normally and homeassistant is back up. I then re-add
> each missing partition to each array (I believe this was my second
> mistake). I go to bed while it reconstructs.
>
> In the morning, the array it was reconstructing is back into pending,
> the raid5 array in question is inactive, and it's reconstructing
> something else. I remove each partition that I previously added to
> each array (although the array in question doesn't even let me do
> this) . I stop the array in question and zero the superblock of the
> partition I wanted to remove. I zero the superblocks on each other
> partition removed. I then re-add each partition to each array and let
> them resync. I now have 3 out of 5 fully operational, one more resync
> in progress.
>
> But my array in question is still kinda hosed. Here's where it's
> strange. Rather than explain everything, here's the status from the
> devices (mdadm -E) and the array (mdadm -D).
> https://pastebin.com/Gyj8d7Z7
>
> Note the table at the end of mdadm -D (end of the paste). It shows
> four devices "removed", then a gap, then 3 devices as 'sync' . If I
> incrementally add the drives it shows a "normal" table. Until I try
> --run, then it shows the odd table. If I add --incremental 3 drives
> (not the 'spare') then run, it shows the pasted table. If I try to add
> the fourth (spare) it says "ADD_NEW_DISK not supported" in dmesg. If I
> add 3 drives including the 'spare' it's the same behavior otherwise,
> but adding the fourth drive complains that it can only add it as a
> spare, and I must use force-spare to add it (I suspect this would be
> my 3rd mistake if I did it).
>
> I think I can force run this array with sd[abd]4 but the normal
> commands give errors when trying to do so. What's also strange is that
> devices sd[abd]4 all have the same event count, yet trying to start
> the array results in "cannot start dirty degraded array".

-- 
You keep up the good fight just as long as you feel you need to.
-- Ken Danagger