Re: Revive a dead md raid5 array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Op 21-11-18 om 01:29 schreef Wol's lists:
On 20/11/2018 21:53, Jogchum Reitsma wrote:

Op 20-11-18 om 20:32 schreef Wol's lists:
On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,

Is the a way to revive the array, and if yes, how can I do that?

Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to fix the timeout mismatch problem. I suspect that's what blew up your array.
That, I think, is done by "for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done"? Copied from https://marc.info/?l=linux-raid&m=144535576302583&w=2

Or recreate the array, with WD Red disks? Funny thing is, the disk with a mismatch in event count was *not* kicked out of the array...

NO NO NO! Never, EVER, recreate an array except as an absolute last resort. You're nowhere near that!

Unless you mean just doing an ordinary assemble on the new reds. In which case, I'd say yes, go ahead. Just *never* use the --create option with a disk you're trying to recover - your chances of wiping the disk instead are far too high.
Yes, that last option is what I meant to say; a bit foolish of me to use the word "create" here.

Now because you've got a bunch of read errors, I suspect you're going to lose some data, sorry. You have two choices.

1) Force assemble the array with all four drives, and run a repair. This should fix your read errors, but risks losing data thanks to the event counter mismatch.
Excuse my ignorance here, but what do you mean by "repair"? Run a fsck?

https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives

Write "repair" to sync_action ...
Got it.

2) Force assemble the array with the three good drives, then re-add the fourth. If you've got bitmaps, and it re-adds cleanly, then run a repair and you *might* get everything back. Otherwise it might just add, so it will re-sync of its own accord, which will give you a clean array with no errors but the read errors will cause data loss. Sorry.



Okay. It's your choice. I think your best option is - having fixed the time-out problem - to try force-assembling the array using the Blues, and then do a repair. Then if everything looks good swap the Reds in using --replace, retiring the Blues.

If you want to just assemble the Reds into a new array and leave the Blues as a backup, you're likely to end up with silent corruption. Those blocks that didn't copy will be corrupt, with no way to identify them. At least if you try to recover the Blues, you're more likely to trip over read errors and find out what's been corrupted - or better the raid recovery will kick in and repair your data (if checking the blues hits a read error it will try and recover - if you use the reds there will be no read error, and no attempt to recover the data).

Ah, clear.

So this is what I think I should do (leaving the blues in place for now):

   for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done

   mdadm - - stop /dev/md0

   mdadm - - assemble - - force /dev/md0 /dev/sda /dev/sdb /dev/sdd

   mdadm /dev/md0 - - add /dev/sdf

   mdadm - - run /dev/md0

   echo repair > /sys/block/md0/md/sync_action

What puzzles me still a bit, though it's more curiosity, is that it is /dev/sdf which has the deviant event count, while this drive is one of the two that was NOT kicked out of the array.

When the array is up and running healty again, first thing is of course update my oldest full backup, then retire one by one the disks with read errors and replace them by the WD Red ones I already have. Question here is, will mdadm be confused by the fact that these Red disks bear copies of the faulty blue ones?

After that, send the faulty disks to the supplier for guarantee, buy two new WD red disks, and retire one by one the blue ones still in the array.

During this adventure, it came to my mind to alter the array level to raid6, but with disks supporting  SCT/ERC it seems less necessary to me. It would need another 4TB disk to keep the net capacity of the array.

Cheers,
Wol


Many thanks again!! If you have comments on the actions I described above, please let me know!

Cheers, Jogchum.




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux