On 20/11/2018 21:53, Jogchum Reitsma wrote:
Op 20-11-18 om 20:32 schreef Wol's lists:
On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,
That said, though I wouldn't call myself an expert, I know enough to
think everything will be fine. How easy would it be for you to back up
the disk if you recover it read-only?
I'm not sure I understand what you mean by "back up the disk" - I did a
ddrescue from the faulty disks to new WD $TB disks, this time from the
RED series, which do support SCT/ERC. Isn't that just what you mean by
"backup"?
Sorry, I should have read a bit further - I was thinking if you had lost
the array, would you be able to recover your data. But you are well
backed up, so that's okay. If you'd got a 3-disk array running, it would
have been at risk until it was fully recovered. I'm just rather cautious
when data (especially someone else's) is at risk.
Is the a way to revive the array, and if yes, how can I do that?
Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to fix
the timeout mismatch problem. I suspect that's what blew up your array.
Or recreate the array, with WD Red disks? Funny thing is, the disk with
a mismatch in event count was *not* kicked out of the array...
NO NO NO! Never, EVER, recreate an array except as an absolute last
resort. You're nowhere near that!
Unless you mean just doing an ordinary assemble on the new reds. In
which case, I'd say yes, go ahead. Just *never* use the --create option
with a disk you're trying to recover - your chances of wiping the disk
instead are far too high.
Now because you've got a bunch of read errors, I suspect you're going
to lose some data, sorry. You have two choices.
1) Force assemble the array with all four drives, and run a repair.
This should fix your read errors, but risks losing data thanks to the
event counter mismatch.
Excuse my ignorance here, but what do you mean by "repair"? Run a fsck?
https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives
Write "repair" to sync_action ...
2) Force assemble the array with the three good drives, then re-add
the fourth. If you've got bitmaps, and it re-adds cleanly, then run a
repair and you *might* get everything back. Otherwise it might just
add, so it will re-sync of its own accord, which will give you a clean
array with no errors but the read errors will cause data loss. Sorry.
I've got to learn a lot - what do you mean by "If you've got bitmaps"?
Bitmaps are an option (I think they're enabled by default now) so that
when you assemble an array from drives with mismatching event counts (or
re-add a disk that has been booted), it knows which writes have or have
not made it to disk, and just updates the disk. If you haven't got
bitmaps, then re-adding such a disk is just like adding a new drive -
the raid doesn't know what data is or isn't valid on the disk, so it
just recreates the lot. Slow, and very stress-y on the array.
As said, I made copies of the disks with read errors, using ddrescue.
FAFAICS ddrescue managed to overcome some, though not all, read errors,
so I expect the new disks to be better in that respect than the originals.
These new disks support TERC, so that's also improvement.
Wouldn't it be better, with that in mind, to change the disks with read
errors with the new ones, and revive the array with those?
Note that before you do any re-assembly, you need to do a "array stop"
otherwise pretty much anything you try will fail with "device busy".
Okay. It's your choice. I think your best option is - having fixed the
time-out problem - to try force-assembling the array using the Blues,
and then do a repair. Then if everything looks good swap the Reds in
using --replace, retiring the Blues.
If you want to just assemble the Reds into a new array and leave the
Blues as a backup, you're likely to end up with silent corruption. Those
blocks that didn't copy will be corrupt, with no way to identify them.
At least if you try to recover the Blues, you're more likely to trip
over read errors and find out what's been corrupted - or better the raid
recovery will kick in and repair your data (if checking the blues hits a
read error it will try and recover - if you use the reds there will be
no read error, and no attempt to recover the data).
Cheers,
Wol