Re: mdadm expanded 8 disk raid 6 fails in new server, 5 original devices show no md superblock

Wilson Jonathan <piercing_male@xxxxxxxxxxx> · Tue, 14 Jan 2014 17:47:33 +0000

On Tue, 2014-01-14 at 08:14 -0500, Phil Turmel wrote:

> ?.  What did "smartctl -l scterc" say?  If it says unsupported, you have
> a problem.  The workaround is to set the driver timeouts to ~180 seconds
> for each such drive.
> 
> If scterc is supported, but disabled, you can set 7-second timeouts with
> "smartctl -l scterc,70,70", but you must do so on every power cycle.
> Either way, you need boot-time scripting or distro support.
> 
> Raid-rated drives power up with a reasonable setting here.
> 
> Many people discover the timeout problem the first time they have an
> otherwise correctable read error in their array, and the array falls
> apart instead.  This list's archives are well-populated with such cases.

Snipped for brevity above.

I understand the issue of "timeout" on drives that might perform long
error checking which then causes mdadm, via the device (block?) driver
issuing a time out, to then kick the drive. In this instance you allow
some time for a drive to try and fix things at the expense of a hung
array for a longer period of time.

I also understand that with scterc the drive gives up (in effect timing
its self out) when it hits the 7 second, or there about, mark and
subsequently mdadm kicks the drive out. In this specific instance the
idea is to kill a drive quickly to that the raid doesn't hang longer
than a few seconds.

However surely these things (bar the amount of time) result in the same
final result of a drive being kicked out. Even in a non-madam hardware
raid set up, the drive is either kicked because it didn't return in 7
seconds, or the drive kicks its self because it gave up before 7
seconds.

If anything surely when you have a degraded array that will fail if any
more disks are kicked then you actually need to do the reverse of normal
raid wisdom... which is set the time out in the device (block) layer to
as long as possible and then if the drives have scterc enabled then
disable it (assuming the drive physically allows it and if disabled
performs a harder, or any, internal retry/crc/etc.) to force the drives
to give their all to get any, as yet unknown, potential failing sectors
back should they occur during a re-build of a failed drive.

Surely, unless I'm missing something, rebuilding a failed drive's data
means that you want the system to not kick if at all possible and having
scterc enabled or a short timeout (shorter than the drives max time,
unless that time is indefinite retry) is the last thing you want?

> 
> Regards,
> 
> Phil

Jon

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html