Re: Help with two momentarily failed drives out of a 4x3TB Raid 5

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 11 Mar 2013 14:18:07 -0600

On Mar 10, 2013, at 6:33 PM, Javier Marcet <jmarcet@xxxxxxxxx> wrote:

> On Mon, Mar 11, 2013 at 1:12 AM, Mathias Burén <mathias.buren@xxxxxxxxx> wrote:
>> 
>> So how are the drivers doing? smartctl -a for all HDDs please.
> 
> http://bpaste.net/raw/82828/

Two of four drives report bad sectors as Current_Pending_Sector. We need to see full dmesg for the time when the array collapsed to be sure, but I bet dollars to donuts that disk 1 drops out for some reason (?) and shortly thereafter the other drive experiences ERR UNC for its bad sector causing the array to collapse.

The first disk ejected is probably not in sync with the array and needs to be rebuilt. The other drive might be slightly out of sync, but it's worth forcing assemble to find out. And then these bad sectors need to be repaired which is difficult if the first disk ejected happened before too many writes to the array while it was degraded but before the array collapsed.

The drives clearly are configured incorrectly with their controller and/or the linux SCSI layer timeout for the block devices or you wouldn't have bad sectors pending. Configured correctly, bad sectors are remapped in the course of a normally functioning array, as well as scheduled scrubs.

Ideally the drive SCT ERC is lowered to something like 70 deciseconds. Or if that's not supported by the drives, then the controller and the block device timeout needs to be raised to whatever the drive timeout is using:

echo xxx >/sys/block/sdX/device/timeout

xxs is in seconds. So for a 2 minute drive timeout, you'd need that to be at least 120, maybe a few seconds more to make absolutely certain linux doesn't timeout the block device before the drive itself reports a read error.

> By 18:00 today I should have the smartctl results.

Next to pointless in that it will stop the testing as soon as it finds the first bad sector. But if you have that LBA you can use dd to zero just that sector. While that corrupts the data in that sector, the data is effectively gone anyway, and it will prevent another read error and allow the rebuild to proceed.

> Could a faulty sata data cable cause those bad blocks? 

No. Some bad sectors are normal. Many, or increasing occurrence is cause for them to be replaced under warranty.

> How should I proceed to be able to get most of the data? Will I have
> to create a completely new array or can I somehow fix it adding new
> disks?

You're better off recreating the array and restoring from backup. Fixing it will be tedious.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html