Re: Revive a dead md raid5 array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 20/11/2018 21:53, Jogchum Reitsma wrote:

Op 20-11-18 om 20:32 schreef Wol's lists:
On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,



That said, though I wouldn't call myself an expert, I know enough to think everything will be fine. How easy would it be for you to back up the disk if you recover it read-only?
I'm not sure I understand what you mean by "back up the disk" - I did a ddrescue from the faulty disks to new WD $TB disks, this time from the RED series, which do support SCT/ERC. Isn't that just what you mean by "backup"?

Sorry, I should have read a bit further - I was thinking if you had lost the array, would you be able to recover your data. But you are well backed up, so that's okay. If you'd got a 3-disk array running, it would have been at risk until it was fully recovered. I'm just rather cautious when data (especially someone else's) is at risk.





Is the a way to revive the array, and if yes, how can I do that?

Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to fix the timeout mismatch problem. I suspect that's what blew up your array.

Or recreate the array, with WD Red disks? Funny thing is, the disk with a mismatch in event count was *not* kicked out of the array...

NO NO NO! Never, EVER, recreate an array except as an absolute last resort. You're nowhere near that!

Unless you mean just doing an ordinary assemble on the new reds. In which case, I'd say yes, go ahead. Just *never* use the --create option with a disk you're trying to recover - your chances of wiping the disk instead are far too high.

Now because you've got a bunch of read errors, I suspect you're going to lose some data, sorry. You have two choices.

1) Force assemble the array with all four drives, and run a repair. This should fix your read errors, but risks losing data thanks to the event counter mismatch.
Excuse my ignorance here, but what do you mean by "repair"? Run a fsck?

https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives

Write "repair" to sync_action ...

2) Force assemble the array with the three good drives, then re-add the fourth. If you've got bitmaps, and it re-adds cleanly, then run a repair and you *might* get everything back. Otherwise it might just add, so it will re-sync of its own accord, which will give you a clean array with no errors but the read errors will cause data loss. Sorry.

I've got to learn a lot - what do you mean by "If you've got bitmaps"?

Bitmaps are an option (I think they're enabled by default now) so that when you assemble an array from drives with mismatching event counts (or re-add a disk that has been booted), it knows which writes have or have not made it to disk, and just updates the disk. If you haven't got bitmaps, then re-adding such a disk is just like adding a new drive - the raid doesn't know what data is or isn't valid on the disk, so it just recreates the lot. Slow, and very stress-y on the array.

As said, I made copies of the disks with read errors, using ddrescue. FAFAICS ddrescue managed to overcome some, though not all, read errors, so I expect the new disks to be better in that respect than the originals.
These new disks support TERC, so that's also improvement.
Wouldn't it be better, with that in mind, to change the disks with read errors with the new ones, and revive the array with those?



Note that before you do any re-assembly, you need to do a "array stop" otherwise pretty much anything you try will fail with "device busy".

Okay. It's your choice. I think your best option is - having fixed the time-out problem - to try force-assembling the array using the Blues, and then do a repair. Then if everything looks good swap the Reds in using --replace, retiring the Blues.

If you want to just assemble the Reds into a new array and leave the Blues as a backup, you're likely to end up with silent corruption. Those blocks that didn't copy will be corrupt, with no way to identify them. At least if you try to recover the Blues, you're more likely to trip over read errors and find out what's been corrupted - or better the raid recovery will kick in and repair your data (if checking the blues hits a read error it will try and recover - if you use the reds there will be no read error, and no attempt to recover the data).

Cheers,
Wol



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux