Re: Revive a dead md raid5 array

"Wol's lists" <antlists@xxxxxxxxxxxxxxx> · Tue, 20 Nov 2018 19:32:15 +0000

On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,

New to this list, I understand that I can post a problem with an md raid 
array here. If I'm wrong in that supposition, please let me know, and 
accept my apologies!

That's right. Here you get the experts, but they may take time 
responding :-(

That said, though I wouldn't call myself an expert, I know enough to 
think everything will be fine. How easy would it be for you to back up 
the disk if you recover it read-only?

I have a 4-disk rad5 array, of which two of the disks are kicked out 
because of read errors. The disks are WD Blue 4TB disks, which are still 
under guarantee.

Just looked at the spec of those drives. It looks a bit worrisome to me. 
take a look at the raid wiki 
https://raid.wiki.kernel.org/index.php/Linux_Raid

What I would really like to see is whether these drives support SCT/ERC. 
If they don't there is our first problem. I notice WD says raids 0 and 
1, not 5 ...

I have reasonable recent backups, but yet I would like to try to get the 
array alive again.

That shouldn't be hard.

Funny thing is, mdadm -- examine states the array as being raid0:

    /dev/md0:
                Version : 1.2
             Raid Level : raid0
          Total Devices : 4
            Persistence : Superblock is persistent

This seems to be recent glitch in mdadm. Don't worry about it ...

=================================================================

Maybe you noticed the fact that all disks are marked as spare, and that 
the event count of one of the disks /dev/sdf is different from the other's

I found some more occurrences of a raid5 being recognized as a raid0 
device, but not a real solution to this.

The solution, iirc, was just to stop the array and re-assemble it - as 
soon as the array was running, it sorted itself out.

============================================================================================= 

The faulty disks are /dev/sda an /dev/sdd, and I copied the contents to 
new WD RED 4TB disks, with

ddrescue -d -s <size-of-target-disk> -f /dev/sd<source> /dev/sd<target> 
sd<source>.map

The size argument because the new disks are some 4MB smaller than the 
original.

ddrescue saw one one disk 14, on the other 54 read errors, and copied 
99.99% of the source.

Is the a way to revive the array, and if yes, how can I do that?

Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to fix 
the timeout mismatch problem. I suspect that's what blew up your array.

Now because you've got a bunch of read errors, I suspect you're going to 
lose some data, sorry. You have two choices.

1) Force assemble the array with all four drives, and run a repair. This 
should fix your read errors, but risks losing data thanks to the event 
counter mismatch.

2) Force assemble the array with the three good drives, then re-add the 
fourth. If you've got bitmaps, and it re-adds cleanly, then run a repair 
and you *might* get everything back. Otherwise it might just add, so it 
will re-sync of its own accord, which will give you a clean array with 
no errors but the read errors will cause data loss. Sorry.

Note that before you do any re-assembly, you need to do a "array stop" 
otherwise pretty much anything you try will fail with "device busy".

Cheers,
Wol