Re: Question about raid robustness when disk fails

Goswin von Brederlow <goswin-v-b@xxxxxx> · Wed, 27 Jan 2010 11:25:53 +0100

Asdo <asdo@xxxxxxxxxxxxx> writes:

> Goswin von Brederlow wrote:
>> Michael Evans <mjevans1983@xxxxxxxxx> writes:
>>
>>> Why doesn't the kernel issue a pessimistic alternate 'read' path (on
>>> the other drives needed to obtain the data) if the ideal method is
>>> late.  It would be more useful for time-sensitive/worst case buffering
>>> to be able to customize when to 'give up' dynamically.
>>>
>>
>> That is a verry good question. I look forward to seeing patches for this
>> from you. :) I think it isn't done because nobody has bothered to write
>> the code yet but maybe I'm wrong and it would make the code too
>> complicated.
>>
>
> This is probably more complicated than allowing a timeout to be set at
> the MD layer or block-device layer, isn't it?

There is a timeout at various levels already but for example the scsi
specs alow for quite some time till you give up, as in a minute. You
would certainly want something much much smaller here.

So from the top of my head here is what I imagine you need: You would
need to set a timeout for reading a block. Then once the timeout is
reached you need to read the rest of the stripe if not available
already. Do you ready every block in a stripe or just enough to get the
data? You might not need all blocks, e.g. a 3 way raid1 or a raid6
doesn't need all blocks. But then you have another timeout situation
there.

So lets say we read all blocks for simplicity sake. Then you might have
scheduled more reads than you need and when enough reads were
successfull you should not wait for the rest but return the data
imediatly. Late arrivals from extra reads (or the original) you then
have to also handle. Or do you cancel them? Also the original read might
succeed before the extra reads return.

It might also be wise to notice when additional reads are slower than
the original and if that happens often then increase the initial timeout
slightly. But a warning for the admin would do to so he can adjust the
timeout himself.

I don't think setting the timeout for the initial read will be
complicated but handling the alternatives will be not trivial. If yo
implement it you probably find more problems along the way.

> Which would be just as good I think.
>
> Is it possible to cancel a SATA/SCSI command that is being executed by
> the drive?
> (it's probably feasible only with NCQ disabled anyway, but it's easy
> to disable NCQ)

Do you want to do that? I would rather have the drive keep trying and
return an error if it can't read so the raid layer rewrites the blocks
causing it to be remapped. I do not want to wait for that but I want it
to happen.

> It's a pity we have to rely on TLER, this narrows the choice of drives
> a lot...

I don't. I just acknowledge the limitation and accept the downtime to
find and remove a broken but not properly failed disk. I use raid so I
don't loose my data when a disk fails, not primarily for availability.
So far I had one case in 10 years where a failing disk took down my
system.

MfG
        Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html