desktop disk's error recovery timouts (was: re-add POLICY)

Chris <email.bug@xxxxxxxx> · Mon, 16 Feb 2015 16:15:19 +0000 (UTC)

Phil Turmel <philip <at> turmel.org> writes:

> On 02/16/2015 07:23 AM, Chris wrote:
> > .... with raid members that got pulled and are save to
> > re-sync. (e.g. after the occasional bad block error that gets remapped by
> > the hardrives firmware)
> 
> This should not be part of your concern here, as MD will handle
> occassional UREs by reconstructing them and rewriting them on the fly,

Phil, thank you for dropping in with this hint. It very likly applies to
the disks in the docking station. I searched the mailing list, most hits
said to search for the keywords, though. ;-) 

To understand the issue, I think
https://en.wikipedia.org/wiki/Error_recovery_control
was good.

It would be good if this configuration information could be available there
or at https://raid.wiki.kernel.org

Cheers,
Chris

----

I compiled some snippets from your messages, that could serve as a basis to
correction/completion by someone knowledgeable:

The default linux controller timeout is 30 seconds.  Drives
that spend longer than the timeout in recovery will be reset.  If they
don't respond to the reset (because they're busy in recovery) when the
raid tries to write the correct data back to them, they will be kicked
out of the array.

You *must* set ERC shorter than the
timeout, or set the driver timeout longer than the drive's worst-case
recovery time.  The defaults for desktop drives are *not* suitable for
linux software raid.

I strongly encourage you to run "smartctl -l scterc /dev/sdX" for each
of your drives.  For any drive that warns that it doesn't support SCT
ERC, set the controller device timeout to 180 like so:

echo 180 >/sys/block/sdX/device/timeout

If the report says read or write ERC is disabled, run "smartctl -l
scterc,70,70 /dev/sdX" to set it to 7.0 seconds.

You then set up a boot-time script to do these adjustments at every restart,
and make sure you performing regular scrub runs to ...?

You might not want that kind of long device timeout, but then you shouldn't
use desktop drives in md RAID.

Anyone using desktop drives which don't support SCT ERC in md RAID is 
liable to see long timeouts on the simplest bad sector, and they 
probably prefer to keep the drive in the array AND have the sector 
rewritten after reconstruction than have the drive failed out of the array.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html