Re: Triple parity and beyond

David Brown <david.brown@xxxxxxxxxxxx> · Fri, 22 Nov 2013 01:32:09 +0100

On 21/11/13 21:52, Piergiorgio Sartor wrote:
> Hi David,
> 
> On Thu, Nov 21, 2013 at 09:31:46PM +0100, David Brown wrote:
> [...]
>> If this can all be done to give the user an informed choice, then it
>> sounds good.
> 
> that would be my target.
> To _offer_ more options to the (advanced) user.
> It _must_ always be under user control.
> 
>> One issue here is whether the check should be done with the filesystem
>> mounted and in use, or only off-line.  If it is off-line then it will
>> mean a long down-time while the array is checked - but if it is online,
>> then there is the risk of confusing the filesystem and caches by
>> changing the data.
> 
> Currently, "raid6check" can work with FS mounted.
> I got the suggestion from Neil (of course).
> It is possible to lock one stripe and check it.
> This should be, at any given time, consistent
> (that is, the parity should always match the data).
> If an error is found, it is reported.
> Again, the user can decide to fix it or not,
> considering all the FS consequences and so on.
> 

If you can lock stripes, and make sure any old data from that stripe is
flushed from the caches (if you change it while locked), then that
sounds ideal.

>> Most disk errors /are/ detectable, and are reported by the underlying
>> hardware - small surface errors are corrected by the disk's own error
>> checking and correcting mechanisms, and larger errors are usually
>> detected.  It is (or should be!) very rare that a read error goes
>> undetected without there being a major problem with the disk controller.
>>  And if the error is detected, then the normal raid processing kicks in
>> as there is no doubt about which block has problems.
> 
> That's clear. That case is an "erasure" (I think)
> and it is perfectly in line with the usual operation.
> I'm not trying to replace this mechanism.
>  
>> If you can be /sure/ about which data block is incorrect, then I agree -
>> but you can't be /entirely/ sure.  But I agree that you can make a good
>> enough guess to recommend a fix to the user - as long as it is not
>> automatic.
> 
> One typical case is when many errors are
> found, belonging to the same disk.
> This case clearly shows the disk is to be
> replaced or the interface checked...
> But, again, the user is the master, not the
> machine... :-)

I don't know what sort of interface you have for the user, but I guess
that means you'll have to collect a number of failures before showing
them so that the user can see the correlation on disk number.

>  
>> For most ECC schemes, you know that all your blocks are set
>> synchronously - so any block that does not fit in, is an error.  With
>> raid, it could also be that a stripe is only partly written - you can
> 
> Could it be?
> I would consider this an error.

It could occur as the result of a failure of some sort (kernel crash,
power failure, temporary disk problem, etc.).  More generally, md raid
doesn't have to be on local physical disks - maybe one of the "disks" is
an iSCSI drive or something else over a network that could have failures
or delays.  I haven't thought through all cases here - I am just
throwing them out as possibilities that might cause trouble.

> The stripe must always be consistent, there
> should be a transactional mechanism to make
> sure that, if read back, the data is always
> matching the parity.
> When I write "read back" I mean from whatever
> the data is: physical disk or cache.
> Otherwise, the check must run exclusively on
> the array (no mounted FS, no other things
> running on it).
> 
>> have two different valid sets of data mixed to give an inconsistent
>> stripe, without any good way of telling what consistent data is the best
>> choice.
>>  
>> Perhaps a checking tool can take advantage of a write-intent bitmap (if
>> there is one) so that it knows if an inconsistent stripe is partly
>> updated or the result of a disk error.
> 
> Of course, this is an option, which should be
> taken into consideration.
> 
> Any improvement idea is welcome!!!
> 
> bye,
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html