Re: md failing mechanism

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Mon, 25 Jan 2016 09:13:16 +1100

On 24/01/2016 06:02, James J wrote:
On 23/01/2016 15:09, Wols Lists wrote:
On 22/01/16 23:40, James J wrote:
The recommentation of raising the timeout to 120+ is for the opposite
purpose of what you want. It is for the case the sysadmin accepts to
wait a long time because he wants to prevent the kicking of the 
drive at
the first read-error (normally drives are kicked for a write error).
This might be wanted in order to a) defer the replacement of the drive,
either to perform the replacement at a more opportune time and/or in a
better manner such as a no-degrade replace operation, or b) because he
does not want to replace the drive at all: maybe he believes that the
error might be spurious and will not happen again and the drive is 
still
of acceptable fitness for the purpose, e.g. in a low-cost file server.
Except, aiui, even in your scenario! drives are kicked for a *write* 
error.

What happens (should be) is the kernel times out, the raid handles the
read error by trying a rewrite, the drive is still hung on the read
error so it doesn't respond to the write request, and the drive gets
kicked for a write failure.

Oh yes you are correct, so the drive would be kicked after 60secs and 
not after 30secs contrary to what I said.
So the sequence would be: drive stuck on read --> scsi read failure 
due to timeout at the 30th second --> MD receives failure and attempts 
rewrite --> scsi write failure due to timeout at the 60th second --> 
drive kicked by MD at the 60th second
I think this is what should have happened, but it didn't happen like 
this anyway so I think there is probably a kernel bug somewhere.
I don't have a lot to add, except that I recall the OP suggested it was 
an IDE drive. I wonder if the IDE sub-system and/or hardware operates 
differently compared to the sata variants. Possibly the MD layer never 
got any timeout or error on the read, and (or maybe it was the write) 
and hence it was never kicked from the array.

Regards,
Adam
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html