Re: Data scribbling for raid6 is wrong?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 15/12/2011 05:27, NeilBrown wrote:
On Thu, 15 Dec 2011 11:15:42 +0800 Tao Guo<glorioustao@xxxxxxxxx>  wrote:

Hi all,

Today I checked the raid6's scribbling code, and I think there may be
some improvements to handle bad sectors:

I think the common term is "scrubbing" rather than "scribbling".


If we have one bad sector(with corrupted data) in the data block,
scribbling thread will find parity mismatch and will try to
recompute&rewrite P&Q, but that will cause final data loss.
Since we have P&  Q,  actually we can try to use them and find out the
wrong data block then fix it.

But the algorithm to find the bad data block seems not
straightforward... Does anyone know if there is any paper discussed
this issue before?

Update: I just found there is one talk about this in FAST08:
http://www.usenix.org/events/fast08/tech/krioukov.html.
But that approach will add checksums.etc. For bare-bones raid6, does
any guru have any better ideas?

http://neil.brown.name/blog/20100211050355


I've read this before, but I've had another think about it in the case of RAID6.

I agree that there is little point in a complex repair algorithm in most cases - without additional knowledge of the ordering of the writes in a stripe, you don't know which blocks are new data, and which are old data. So even if you have a 5-way mirror and 4 of them agree, it could still be the fifth block that is the new data. As you say in your blog, the only correct thing to do is a simple repair to get consistency, and let the filesystem figure out if the data is usable or not.

The only situation when a smart repair could make sense is for a RAID6 array, when the array is off-line (as you say, you never want to risk changing data that the filesystem may already have read - thus any change to the data blocks in the stripe must be done off-line). If a RAID6 stripe is read, and found to be inconsistent, it is possible to try to find a consistent subset. The algorithm to do so is quite straightforward, if a little slow (I'll explain it if anyone is interested). And if such a consistent sub-stripe is found, then you can be very confident that this means the data is correct (the chances of it not being correct are truly minuscule, not just "small"), and the single block is inconsistent.

The question then is what could lead to a stripe with one inconsistent block. Perhaps only the one block had been written before power failure, or perhaps only the one block was still to be written before the crash. Or perhaps in the case of a single block partial write to the stripe, the data block had been written but neither of the parity blocks had been updated. Or perhaps that one block has had some sort of error (read or write) that was not flagged by the disk controller. It is therefore often a better choice, or at least no worse, to choose this block to change when making the stripe consistent, rather than picking on the P and Q blocks.

When doing online scrubbing, priority has to go to giving the filesystem consistent data - this means using the "simple" approach of re-generating P and Q. That is also by far the faster method, which is a useful bonus when online and working.

So if we assume that "smart" recovery of RAID6 during offline resyncs is technically better on average than "simple" recovery, the question then becomes one of whether it is worth implementing. To answer that, we have to look at the situations where it would be useful, the complexity of the code, and the benefits of having it.

As explained in the blog entry, there are a few cases that can lead to inconsistent stripes. These are either serious hardware errors or serious administrator errors, neither of which benefit much from any sort of automatic recovery, or crashes or power fails in the middle of a stripe write. Assuming that people that pay for the reliability of RAID6 in extra disk costs also often buy an UPS and pick stable kernels and distributions, such events are going to be very rare.

The code won't be too complex - but it still means extra code and extra work in testing.

We already know that the filesystem will have to cope with bad stripes anyway - a journalled filesystem will know that the stripe is questionable after a crash, and will ensure that the metadata is consistent. But smart recovery increases the chance that file data is also saved. Of course, it gives no guarantees - sometimes it will roll-back a single-block change, sometimes it will complete the stripe write.

Ultimately, which I think such "smart" recovery would give slightly better data recovery on average, it's worst case behaviour is not any worse than the "simple" recovery, and it would very rarely be triggered in practice. There are many other features on the raid developers' "things to do" lists that will be of much more benefit than implementing this.



Of course, there is another resync/repair strategy that is even simpler than the "simple" one used today, and even faster. If a stripe is found to be inconsistent, we could simply zero out the whole stripe. It would be no worse than the worst case situation with any other algorithm. It is also arguably better to give the filesystem no data rather than possibly bad data.





--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux