Re: devices get kicked from RAID about once a month

Dan Christensen <jdc@xxxxxx> · Fri, 04 Jun 2010 09:30:09 -0400

Neil Brown <neilb@xxxxxxx> writes:

> On Thu, 03 Jun 2010 12:47:39 -0400 Dan Christensen <jdc@xxxxxx> wrote:
>
>> That could be useful.  And, as Neil said, if the SATA driver could be
>> told to use longer timeouts, that might help.  Neil, if you think that's
>> a good idea, maybe you could put the request in with the SATA folks?
>
> It might be a good idea.

After thinking about it more, I'm not sure I fully understand the
situation.  

If I was able to turn on something like TLER on the drives, so read
errors failed more quickly, what would the raid layer do when it got
a read error? 

If the raid layer handles this in a clever way (and I recall some
discussions about this), e.g. by reconstructing the data and rewriting
the sector allowing the drive to remap it, then what I don't fully
understand is why it doesn't also do this when there is a timeout on a
read.  Is it because timeouts can indicate more serious problems?  Even
so, wouldn't it be reasonable for the raid layer to give the drive a
second chance before assuming it has failed?

These questions are motivated from the following logic.  Since it is
generally recognized that quicker read errors (e.g. TLER) are good
for drives in a raid array, *increasing* the SATA timeouts seems like it
is going in the wrong direction.  Wouldn't it be better to have short
timeouts, but have the raid layer treat a timeout less seriously?

Dan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html