Re: Failed during rebuild (raid5)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/05/13 12:23, Andreas Boman wrote:
I have (had?) a raid 5 array, with 5 disks (1.5TB/EA), smartd warned one was getting bad, so I replaced it with an identical disk.
I issued mdadm --manage --add /dev/md127 /dev/sdX

The array seemed to be rebuilding, was at around 15% when I went to bed.

This morning I came up to see the array degraded with two missing drives, another failed during the rebuild.

I powered the system down, and since I have the disk smartd flagged as bad and tried to just plug that in and power up hoping to see the array come back up -no such luck (not enough disks).

Unfortunately this happens way too often: Your RAID members silently fail over time. They will get some bad blocks, and you won't know about it until you try to read or write one of the bad blocks. When that happen a disk will get kicked out. At this stage you'll replace the disk, not knowing that other areas of the other RAID members have also failed. The only sensible option is to run a RAID 6 which dramatically reduces the potential for double failure, or to run a RAID 5 but run a weekly (at least) check of the entire array for badblocks, carefully monitoring the smart reported changes after running the test (trying to read the entire array will cause badblocks to be detected and reallocated if any).
I powered the system down again, and now I'm trying to evaluate my best options to recover. Hoping to have some good advice in my inbox when I get back from the office. I'll be able to boot the thing up and get log info this afternoon.

I had to recover an array like that twice. The most important is probably to mitigate the data loss on the second drive that is failing right now by "ddrescueing" all of its data on another drive before it gets more damaged (the longer the failing drive is online the less chance you have). Use GNU ddrescue for that purpose.

Once you have rescued the failing drive onto a new one, you could then try to add that new recovered drive in place of the failing one and start the resync as you did before.

Note that it would probably be worthwhile to ddrescue the initial drive that you took out (if it is still good enough to do so) in case the second drive cannot be recovered correctly or is missing some data.

Regards,
Ben.

Thanks!
Andreas

(please cc me, not subscribed)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux