Re: Triple parity and beyond

David Brown <david.brown@xxxxxxxxxxxx> · Thu, 21 Nov 2013 21:31:46 +0100

On 21/11/13 21:05, Piergiorgio Sartor wrote:
> On Thu, Nov 21, 2013 at 11:13:29AM +0100, David Brown wrote:
> [...]
>> Ah, you are trying to find which disk has incorrect data so that you can
>> change just that one disk?  There are dangers with that...
> 
> Hi David,
> 
>> <http://neil.brown.name/blog/20100211050355>
> 
> I think we already did the exercise, here :-)
> 
>> If you disagree with this blog post (and I urge you to read it in full
> 
> We discussed the topic (with Neil) and, if I
> recall correctly, he is agaist having an
> _automatic_ error detectio and correction _in_
> kernel.
> I fully agree with that: user space is better
> and it should not be automatic, but it should
> do things under user control.
> 

OK.

> The current "check" operetion is pretty poor.
> It just reports how many mismatches, it does
> not even report where in the array.
> The first step, independent from how many
> parities one has, would be to tell the user
> where the mismatches occurred, so it would
> be possible to check the FS at that position.

Certainly it would be good to give the user more information.  If you
can tell the user where the errors are, and what the likely failed block
is, then that would be very useful.  If you can tell where it is in the
filesystem (such as which file, if any, owns the blocks in question)
then that would be even better.

> Having a multi parity RAID allows to check
> even which disk.
> This would provide the user with a more
> comprehensive (I forgot the spelling)
> information.
> 
> Of course, since we are there, we can
> also give the option to fix it.
> This would be much likely a "fsck".

If this can all be done to give the user an informed choice, then it
sounds good.

One issue here is whether the check should be done with the filesystem
mounted and in use, or only off-line.  If it is off-line then it will
mean a long down-time while the array is checked - but if it is online,
then there is the risk of confusing the filesystem and caches by
changing the data.

> 
>> first), then this is how I would do a "smart" stripe recovery:
>>
>> First calculate the parities from the data blocks, and compare these
>> with the existing parity blocks.
>>
>> If they all match, the stripe is consistent.
>>
>> Normal (detectable) disk errors and unrecoverable read errors get
>> flagged by the disk and the IO system, and you /know/ there is a problem
>> with that block.  Whether it is a data block or a parity block, you
>> re-generate the correct data and store it - that's what your raid is for.
> 
> That's not always the case, otherwise
> having the mismatch count would be useless.
> The issue is that errors appear, whatever
> the reason, without being reported by the
> underlying hardware.
>  

(I know you know how this works, so I am not trying to be patronising
with this explanation - I just think we have slightly misunderstood what
the other is saying, so spelling it out will hopefully make it clearer.)

Most disk errors /are/ detectable, and are reported by the underlying
hardware - small surface errors are corrected by the disk's own error
checking and correcting mechanisms, and larger errors are usually
detected.  It is (or should be!) very rare that a read error goes
undetected without there being a major problem with the disk controller.
 And if the error is detected, then the normal raid processing kicks in
as there is no doubt about which block has problems.

>> If you have no detected read errors, and there is one parity
>> inconsistency, then /probably/ that block has had an undetected read
>> error, or it simply has not been written completely before a crash.
>> Either way, just re-write the correct parity.
> 
> Why re-write the parity if I can get
> the correct data there?
> If can be sure that one data block is
> incorrect and I can re-create properly,
> that's the thing to do.

If you can be /sure/ about which data block is incorrect, then I agree -
but you can't be /entirely/ sure.  But I agree that you can make a good
enough guess to recommend a fix to the user - as long as it is not
automatic.

>  
>> Remember, this is not a general error detection and correction scheme -
> 
> It is not, but it could be. For free.
> 

For most ECC schemes, you know that all your blocks are set
synchronously - so any block that does not fit in, is an error.  With
raid, it could also be that a stripe is only partly written - you can
have two different valid sets of data mixed to give an inconsistent
stripe, without any good way of telling what consistent data is the best
choice.

Perhaps a checking tool can take advantage of a write-intent bitmap (if
there is one) so that it knows if an inconsistent stripe is partly
updated or the result of a disk error.

mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html