On 23/01/2016 15:09, Wols Lists wrote:
On 22/01/16 23:40, James J wrote:
The recommentation of raising the timeout to 120+ is for the opposite
purpose of what you want. It is for the case the sysadmin accepts to
wait a long time because he wants to prevent the kicking of the drive at
the first read-error (normally drives are kicked for a write error).
This might be wanted in order to a) defer the replacement of the drive,
either to perform the replacement at a more opportune time and/or in a
better manner such as a no-degrade replace operation, or b) because he
does not want to replace the drive at all: maybe he believes that the
error might be spurious and will not happen again and the drive is still
of acceptable fitness for the purpose, e.g. in a low-cost file server.
Except, aiui, even in your scenario! drives are kicked for a *write* error.
What happens (should be) is the kernel times out, the raid handles the
read error by trying a rewrite, the drive is still hung on the read
error so it doesn't respond to the write request, and the drive gets
kicked for a write failure.
Oh yes you are correct, so the drive would be kicked after 60secs and
not after 30secs contrary to what I said.
So the sequence would be: drive stuck on read --> scsi read failure due
to timeout at the 30th second --> MD receives failure and attempts
rewrite --> scsi write failure due to timeout at the 60th second -->
drive kicked by MD at the 60th second
I think this is what should have happened, but it didn't happen like
this anyway so I think there is probably a kernel bug somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html