Re: md failing mechanism

James J <james.j@xxxxxxxxxxxxx> · Sat, 23 Jan 2016 20:02:00 +0100

On 23/01/2016 15:09, Wols Lists wrote:
On 22/01/16 23:40, James J wrote:
The recommentation of raising the timeout to 120+ is for the opposite
purpose of what you want. It is for the case the sysadmin accepts to
wait a long time because he wants to prevent the kicking of the drive at
the first read-error (normally drives are kicked for a write error).
This might be wanted in order to a) defer the replacement of the drive,
either to perform the replacement at a more opportune time and/or in a
better manner such as a no-degrade replace operation, or b) because he
does not want to replace the drive at all: maybe he believes that the
error might be spurious and will not happen again and the drive is still
of acceptable fitness for the purpose, e.g. in a low-cost file server.
Except, aiui, even in your scenario! drives are kicked for a *write* error.

What happens (should be) is the kernel times out, the raid handles the
read error by trying a rewrite, the drive is still hung on the read
error so it doesn't respond to the write request, and the drive gets
kicked for a write failure.

Oh yes you are correct, so the drive would be kicked after 60secs and 
not after 30secs contrary to what I said.
So the sequence would be: drive stuck on read --> scsi read failure due 
to timeout at the 30th second --> MD receives failure and attempts 
rewrite --> scsi write failure due to timeout at the 60th second --> 
drive kicked by MD at the 60th second
I think this is what should have happened, but it didn't happen like 
this anyway so I think there is probably a kernel bug somewhere.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html