Re: Data scribbling for raid6 is wrong?

David Brown <david@xxxxxxxxxxxxxxx> · Thu, 15 Dec 2011 10:33:24 +0100

On 15/12/2011 05:27, NeilBrown wrote:
On Thu, 15 Dec 2011 11:15:42 +0800 Tao Guo<glorioustao@xxxxxxxxx>  wrote:

Hi all,

Today I checked the raid6's scribbling code, and I think there may be
some improvements to handle bad sectors:

I think the common term is "scrubbing" rather than "scribbling".

If we have one bad sector(with corrupted data) in the data block,
scribbling thread will find parity mismatch and will try to
recompute&rewrite P&Q, but that will cause final data loss.
Since we have P&  Q,  actually we can try to use them and find out the
wrong data block then fix it.

But the algorithm to find the bad data block seems not
straightforward... Does anyone know if there is any paper discussed
this issue before?

Update: I just found there is one talk about this in FAST08:
http://www.usenix.org/events/fast08/tech/krioukov.html.
But that approach will add checksums.etc. For bare-bones raid6, does
any guru have any better ideas?

http://neil.brown.name/blog/20100211050355

I've read this before, but I've had another think about it in the case 
of RAID6.

I agree that there is little point in a complex repair algorithm in most 
cases - without additional knowledge of the ordering of the writes in a 
stripe, you don't know which blocks are new data, and which are old 
data.  So even if you have a 5-way mirror and 4 of them agree, it could 
still be the fifth block that is the new data.  As you say in your blog, 
the only correct thing to do is a simple repair to get consistency, and 
let the filesystem figure out if the data is usable or not.

The only situation when a smart repair could make sense is for a RAID6 
array, when the array is off-line (as you say, you never want to risk 
changing data that the filesystem may already have read - thus any 
change to the data blocks in the stripe must be done off-line).  If a 
RAID6 stripe is read, and found to be inconsistent, it is possible to 
try to find a consistent subset.  The algorithm to do so is quite 
straightforward, if a little slow (I'll explain it if anyone is 
interested).  And if such a consistent sub-stripe is found, then you can 
be very confident that this means the data is correct (the chances of it 
not being correct are truly minuscule, not just "small"), and the single 
block is inconsistent.

The question then is what could lead to a stripe with one inconsistent 
block.  Perhaps only the one block had been written before power 
failure, or perhaps only the one block was still to be written before 
the crash.  Or perhaps in the case of a single block partial write to 
the stripe, the data block had been written but neither of the parity 
blocks had been updated.  Or perhaps that one block has had some sort of 
error (read or write) that was not flagged by the disk controller.  It 
is therefore often a better choice, or at least no worse, to choose this 
block to change when making the stripe consistent, rather than picking 
on the P and Q blocks.

When doing online scrubbing, priority has to go to giving the filesystem 
consistent data - this means using the "simple" approach of 
re-generating P and Q.  That is also by far the faster method, which is 
a useful bonus when online and working.

So if we assume that "smart" recovery of RAID6 during offline resyncs is 
technically better on average than "simple" recovery, the question then 
becomes one of whether it is worth implementing.  To answer that, we 
have to look at the situations where it would be useful, the complexity 
of the code, and the benefits of having it.

As explained in the blog entry, there are a few cases that can lead to 
inconsistent stripes.  These are either serious hardware errors or 
serious administrator errors, neither of which benefit much from any 
sort of automatic recovery, or crashes or power fails in the middle of a 
stripe write.  Assuming that people that pay for the reliability of 
RAID6 in extra disk costs also often buy an UPS and pick stable kernels 
and distributions, such events are going to be very rare.

The code won't be too complex - but it still means extra code and extra 
work in testing.

We already know that the filesystem will have to cope with bad stripes 
anyway - a journalled filesystem will know that the stripe is 
questionable after a crash, and will ensure that the metadata is 
consistent.  But smart recovery increases the chance that file data is 
also saved.  Of course, it gives no guarantees - sometimes it will 
roll-back a single-block change, sometimes it will complete the stripe 
write.

Ultimately, which I think such "smart" recovery would give slightly 
better data recovery on average, it's worst case behaviour is not any 
worse than the "simple" recovery, and it would very rarely be triggered 
in practice.  There are many other features on the raid developers' 
"things to do" lists that will be of much more benefit than implementing 
this.

Of course, there is another resync/repair strategy that is even simpler 
than the "simple" one used today, and even faster.  If a stripe is found 
to be inconsistent, we could simply zero out the whole stripe.  It would 
be no worse than the worst case situation with any other algorithm.  It 
is also arguably better to give the filesystem no data rather than 
possibly bad data.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html