On 03/05/13 12:23, Andreas Boman wrote:
I have (had?) a raid 5 array, with 5 disks (1.5TB/EA), smartd warned
one was getting bad, so I replaced it with an identical disk.
I issued mdadm --manage --add /dev/md127 /dev/sdX
The array seemed to be rebuilding, was at around 15% when I went to bed.
This morning I came up to see the array degraded with two missing
drives, another failed during the rebuild.
I powered the system down, and since I have the disk smartd flagged as
bad and tried to just plug that in and power up hoping to see the
array come back up -no such luck (not enough disks).
Unfortunately this happens way too often: Your RAID members silently
fail over time. They will get some bad blocks, and you won't know about
it until you try to read or write one of the bad blocks. When that
happen a disk will get kicked out. At this stage you'll replace the
disk, not knowing that other areas of the other RAID members have also
failed. The only sensible option is to run a RAID 6 which dramatically
reduces the potential for double failure, or to run a RAID 5 but run a
weekly (at least) check of the entire array for badblocks, carefully
monitoring the smart reported changes after running the test (trying to
read the entire array will cause badblocks to be detected and
reallocated if any).
I powered the system down again, and now I'm trying to evaluate my
best options to recover. Hoping to have some good advice in my inbox
when I get back from the office. I'll be able to boot the thing up and
get log info this afternoon.
I had to recover an array like that twice. The most important is
probably to mitigate the data loss on the second drive that is failing
right now by "ddrescueing" all of its data on another drive before it
gets more damaged (the longer the failing drive is online the less
chance you have). Use GNU ddrescue for that purpose.
Once you have rescued the failing drive onto a new one, you could then
try to add that new recovered drive in place of the failing one and
start the resync as you did before.
Note that it would probably be worthwhile to ddrescue the initial drive
that you took out (if it is still good enough to do so) in case the
second drive cannot be recovered correctly or is missing some data.
Regards,
Ben.
Thanks!
Andreas
(please cc me, not subscribed)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html