On Fri May 03, 2013 at 12:38:47PM +0100, Benjamin ESTRABAUD wrote: > On 03/05/13 12:23, Andreas Boman wrote: > > I have (had?) a raid 5 array, with 5 disks (1.5TB/EA), smartd warned > > one was getting bad, so I replaced it with an identical disk. > > I issued mdadm --manage --add /dev/md127 /dev/sdX > > > > The array seemed to be rebuilding, was at around 15% when I went to bed. > > > > This morning I came up to see the array degraded with two missing > > drives, another failed during the rebuild. > > > > I powered the system down, and since I have the disk smartd flagged as > > bad and tried to just plug that in and power up hoping to see the > > array come back up -no such luck (not enough disks). > > > Unfortunately this happens way too often: Your RAID members silently > fail over time. They will get some bad blocks, and you won't know about > it until you try to read or write one of the bad blocks. When that > happen a disk will get kicked out. At this stage you'll replace the > disk, not knowing that other areas of the other RAID members have also > failed. The only sensible option is to run a RAID 6 which dramatically > reduces the potential for double failure, or to run a RAID 5 but run a > weekly (at least) check of the entire array for badblocks, carefully > monitoring the smart reported changes after running the test (trying to > read the entire array will cause badblocks to be detected and > reallocated if any). > I'd recommend running a regular check anyway, whatever RAID level you're running. I wouldn't choose to run RAID5 with more than 5 disks, or larger than 1TB members either, but that really comes down to an individual's cost/benefit view on the situation. > > I powered the system down again, and now I'm trying to evaluate my > > best options to recover. Hoping to have some good advice in my inbox > > when I get back from the office. I'll be able to boot the thing up and > > get log info this afternoon. > > > I had to recover an array like that twice. The most important is > probably to mitigate the data loss on the second drive that is failing > right now by "ddrescueing" all of its data on another drive before it > gets more damaged (the longer the failing drive is online the less > chance you have). Use GNU ddrescue for that purpose. > > Once you have rescued the failing drive onto a new one, you could then > try to add that new recovered drive in place of the failing one and > start the resync as you did before. > Not "add", but "force assemble using". Whenever you end up in a situation with too many failed disks to start your array, ddrescue and force assemble (mdadm -Af) is your safest option. If that fails you can look at more extreme measures, but that should always be the first approach. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@xxxxxxxxxxxxxxx> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" |
Attachment:
signature.asc
Description: Digital signature