Op 20-11-18 om 20:32 schreef Wol's lists:
On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,
New to this list, I understand that I can post a problem with an md
raid array here. If I'm wrong in that supposition, please let me
know, and accept my apologies!
That's right. Here you get the experts, but they may take time
responding :-(
No problem! :-)
That said, though I wouldn't call myself an expert, I know enough to
think everything will be fine. How easy would it be for you to back up
the disk if you recover it read-only?
I'm not sure I understand what you mean by "back up the disk" - I did a
ddrescue from the faulty disks to new WD $TB disks, this time from the
RED series, which do support SCT/ERC. Isn't that just what you mean by
"backup"?
I have a 4-disk rad5 array, of which two of the disks are kicked out
because of read errors. The disks are WD Blue 4TB disks, which are
still under guarantee.
Just looked at the spec of those drives. It looks a bit worrisome to
me. take a look at the raid wiki
https://raid.wiki.kernel.org/index.php/Linux_Raid
I already read that, and also (among others) the link
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch mentioned there.
I'm pretty sure I read somewhere that for software raid NAS disks should
*not* be used, so when I created the array I bought WD Blue disks. But,
reading the info in the links mentioned, I now bought 2 WD Red disks,
which indeed support TERC (as WD names it).
What I would really like to see is whether these drives support
SCT/ERC. If they don't there is our first problem. I notice WD says
raids 0 and 1, not 5 ...
See my answer above. And there's no problem buying another two WD Red
disks, to copy the contents of the other two disks to.
I have reasonable recent backups, but yet I would like to try to get
the array alive again.
That shouldn't be hard.
Funny thing is, mdadm -- examine states the array as being raid0:
/dev/md0:
Version : 1.2
Raid Level : raid0
Total Devices : 4
Persistence : Superblock is persistent
This seems to be recent glitch in mdadm. Don't worry about it ...
OK, thanks!
=================================================================
Maybe you noticed the fact that all disks are marked as spare, and
that the event count of one of the disks /dev/sdf is different from
the other's
I found some more occurrences of a raid5 being recognized as a raid0
device, but not a real solution to this.
The solution, iirc, was just to stop the array and re-assemble it - as
soon as the array was running, it sorted itself out.
=============================================================================================
The faulty disks are /dev/sda an /dev/sdd, and I copied the contents
to new WD RED 4TB disks, with
ddrescue -d -s <size-of-target-disk> -f /dev/sd<source>
/dev/sd<target> sd<source>.map
The size argument because the new disks are some 4MB smaller than the
original.
ddrescue saw one one disk 14, on the other 54 read errors, and copied
99.99% of the source.
Is the a way to revive the array, and if yes, how can I do that?
Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to fix
the timeout mismatch problem. I suspect that's what blew up your array.
Or recreate the array, with WD Red disks? Funny thing is, the disk with
a mismatch in event count was *not* kicked out of the array...
Now because you've got a bunch of read errors, I suspect you're going
to lose some data, sorry. You have two choices.
1) Force assemble the array with all four drives, and run a repair.
This should fix your read errors, but risks losing data thanks to the
event counter mismatch.
Excuse my ignorance here, but what do you mean by "repair"? Run a fsck?
2) Force assemble the array with the three good drives, then re-add
the fourth. If you've got bitmaps, and it re-adds cleanly, then run a
repair and you *might* get everything back. Otherwise it might just
add, so it will re-sync of its own accord, which will give you a clean
array with no errors but the read errors will cause data loss. Sorry.
I've got to learn a lot - what do you mean by "If you've got bitmaps"?
As said, I made copies of the disks with read errors, using ddrescue.
FAFAICS ddrescue managed to overcome some, though not all, read errors,
so I expect the new disks to be better in that respect than the originals.
These new disks support TERC, so that's also improvement.
Wouldn't it be better, with that in mind, to change the disks with read
errors with the new ones, and revive the array with those?
Note that before you do any re-assembly, you need to do a "array stop"
otherwise pretty much anything you try will fail with "device busy".
Cheers,
Wol
Thanks a lot!
Cheers, Jogchum