Re: md failing mechanism

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Sat, 23 Jan 2016 14:09:45 +0000

On 22/01/16 23:40, James J wrote:
> The recommentation of raising the timeout to 120+ is for the opposite
> purpose of what you want. It is for the case the sysadmin accepts to
> wait a long time because he wants to prevent the kicking of the drive at
> the first read-error (normally drives are kicked for a write error).
> This might be wanted in order to a) defer the replacement of the drive,
> either to perform the replacement at a more opportune time and/or in a
> better manner such as a no-degrade replace operation, or b) because he
> does not want to replace the drive at all: maybe he believes that the
> error might be spurious and will not happen again and the drive is still
> of acceptable fitness for the purpose, e.g. in a low-cost file server.

Except, aiui, even in your scenario! drives are kicked for a *write* error.

What happens (should be) is the kernel times out, the raid handles the
read error by trying a rewrite, the drive is still hung on the read
error so it doesn't respond to the write request, and the drive gets
kicked for a write failure.

Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html