Re: Raid6 recovery

Phil Turmel <philip@xxxxxxxxxx> · Sat, 21 Mar 2020 15:24:09 -0400

Hi Glenn,

{Convention on kernel.org lists is to interleave replies or bottom post, 
and to trim non-relevant quoted material.  Please do so in the future.}

On 3/21/20 7:54 AM, Glenn Greibesland wrote:
Yes, I am aware of the problems with WD Green and multiple partitions
on single 4TB disk. I am in the middle of getting rid of old disks and
I have enough new drives to stop having multiple partitions on single
drives, but not enough power and free SATA ports. It is just a
temporary solution. Also a reason why I did not
include much details in the original post, I knew it would just
distract from the problem I want to solve right away.

What I need help with now is just getting the array started with the
16 out of 18 disks. Then I can continue migrating data and replacing
old disks as planned.

I've examined the material posted, and the sequence of events described. 
 The --re-add damaged that one drive's role record and there is no 
programmatic way in mdadm to correct it.

Since you seem comfortable reading source code, you might consider byte 
editing that drive's superblock to restore it to "active device 10". 
That is what I would do.  With that corrected, --assemble --force should 
give you a running array.

In lieu of superblock surgery, you will indeed need to perform a 
--create --assume-clean, as you proposed in your original email.  Since 
you have already constructed a syntactically valid command for that 
purpose, with appropriate data offsets, that might be the fastest way to 
get a running array.

I would double-check the /dev/ name versus array "active device" number 
relationship to ensure strict ordering in your --create operation. 
Incorrect ordering will utterly scramble your content.

When I built the array in 2012, I used WD Green. They turned out to be
horrible disks and I have since replaced some of them with WD Red. The
newest disks I've bought are Ironwolves

I also noted the drives with Error Recovery Control turned off.  That is 
not an issue while your array has no redundancy, but is catastrophic in 
any normal array.  It is as bad as having a drive that doesn't do ERC at 
all.  Don't do that.  Do read the "Timeout Mismatch" documentation that 
Anthony recommended, if you haven't yet.

I also recommend, when you get to a running array, that you prioritize 
the backup of its content--get the critical data copied out ASAP.  Your 
array will be very vulnerable to Unrecoverable Read Errors until you've 
completed your reconfiguration onto new drives.  Do not attempt to scrub 
the array or read every file right away, as any URE may break the array 
again.

If UREs do break your array again, you will need to use an 
error-ignoring copy tool (some flavor of ddrescue) to put the readable 
data onto a new device, remove the old device from the system, and then 
--assemble --force with the replacement.  Repeat as needed.

Good luck!

Regards,

Phil