Op 21-11-18 om 01:29 schreef Wol's lists:
On 20/11/2018 21:53, Jogchum Reitsma wrote:
Op 20-11-18 om 20:32 schreef Wol's lists:
On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,
Is the a way to revive the array, and if yes, how can I do that?
Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to
fix the timeout mismatch problem. I suspect that's what blew up your
array.
That, I think, is done by "for x in /sys/block/*/device/timeout ; do
echo 180 > $x ; done"? Copied from
https://marc.info/?l=linux-raid&m=144535576302583&w=2
Or recreate the array, with WD Red disks? Funny thing is, the disk
with a mismatch in event count was *not* kicked out of the array...
NO NO NO! Never, EVER, recreate an array except as an absolute last
resort. You're nowhere near that!
Unless you mean just doing an ordinary assemble on the new reds. In
which case, I'd say yes, go ahead. Just *never* use the --create
option with a disk you're trying to recover - your chances of wiping
the disk instead are far too high.
Yes, that last option is what I meant to say; a bit foolish of me to use
the word "create" here.
Now because you've got a bunch of read errors, I suspect you're
going to lose some data, sorry. You have two choices.
1) Force assemble the array with all four drives, and run a repair.
This should fix your read errors, but risks losing data thanks to
the event counter mismatch.
Excuse my ignorance here, but what do you mean by "repair"? Run a fsck?
https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives
Write "repair" to sync_action ...
Got it.
2) Force assemble the array with the three good drives, then re-add
the fourth. If you've got bitmaps, and it re-adds cleanly, then run
a repair and you *might* get everything back. Otherwise it might
just add, so it will re-sync of its own accord, which will give you
a clean array with no errors but the read errors will cause data
loss. Sorry.
Okay. It's your choice. I think your best option is - having fixed the
time-out problem - to try force-assembling the array using the Blues,
and then do a repair. Then if everything looks good swap the Reds in
using --replace, retiring the Blues.
If you want to just assemble the Reds into a new array and leave the
Blues as a backup, you're likely to end up with silent corruption.
Those blocks that didn't copy will be corrupt, with no way to identify
them. At least if you try to recover the Blues, you're more likely to
trip over read errors and find out what's been corrupted - or better
the raid recovery will kick in and repair your data (if checking the
blues hits a read error it will try and recover - if you use the reds
there will be no read error, and no attempt to recover the data).
Ah, clear.
So this is what I think I should do (leaving the blues in place for now):
for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done
mdadm - - stop /dev/md0
mdadm - - assemble - - force /dev/md0 /dev/sda /dev/sdb /dev/sdd
mdadm /dev/md0 - - add /dev/sdf
mdadm - - run /dev/md0
echo repair > /sys/block/md0/md/sync_action
What puzzles me still a bit, though it's more curiosity, is that it is
/dev/sdf which has the deviant event count, while this drive is one of
the two that was NOT kicked out of the array.
When the array is up and running healty again, first thing is of course
update my oldest full backup, then retire one by one the disks with read
errors and replace them by the WD Red ones I already have.
Question here is, will mdadm be confused by the fact that these Red
disks bear copies of the faulty blue ones?
After that, send the faulty disks to the supplier for guarantee, buy two
new WD red disks, and retire one by one the blue ones still in the array.
During this adventure, it came to my mind to alter the array level to
raid6, but with disks supporting SCT/ERC it seems less necessary to me.
It would need another 4TB disk to keep the net capacity of the array.
Cheers,
Wol
Many thanks again!! If you have comments on the actions I described
above, please let me know!
Cheers, Jogchum.