Re: Question about raid robustness when disk fails

Michael Evans <mjevans1983@xxxxxxxxx> · Thu, 28 Jan 2010 03:52:56 -0800

On Wed, Jan 27, 2010 at 7:34 AM, Goswin von Brederlow <goswin-v-b@xxxxxx> wrote:
> Asdo <asdo@xxxxxxxxxxxxx> writes:
> If you kick of a read-error-recovery and get another error on another
> drive then your raid will be down as well. Better not risk that.
>

I mostly only disagree with this point; everything else is more a
choice of tuning.  Different applications having different desires.

If a read is late, it might be a good idea to force a full stripe
recheck and alert the administrator about the latency/failure.

Current raid levels do not have a way of validating blocks
individually or even as part of a larger data set other than the
stripe; raid 5 has only one set of redundant data, but no way of
determining for sure which data-unit is bad.  Likely the 'slow' drive
should be presumed bad, and the rest of the stripe recalculated.  If
the drive returned data it should be compared to the calculation's
result.  If the data matches then it managed a clean read, but a
re-write should be issued anyway to ensure that the data really is
over-written.  The problem is the case where the data doesn't match,
but we lack another parity chunk (to validate against the computed
recovery data) due to using a single recovery stripe and no
validation.

Suddenly we have two potentially valid solutions and no way of
determining which is correct.  The question is which source you
believe is less likely to have an error: a single potentially faulty
drive that has returned a maybe error-corrected read (which has
certain odds of being correct), or a set of other drives that happened
to return data a little more quickly, but which all also have an
inherent risk (though less /per drive/) of error.  My gut reaction is
that the more nominally timed and reacting drives /likely/ are the
correct source, but that there is still an unquantified risk of
failure.

Of course, this is also a slightly moot point; in the above model we'd
have received one or the other first and already passed it to the next
layer.  A simpler approach would be to presume a soft failure on
timeout and unconditionally follow the re-computation path; only
dropping it if one of those drives also had an error or timeout.

The highly risk adverse or truly paranoid might willingly sacrifice
just a little more storage to 'make extra sure' that the data is
correct; which would greatly simplify the risks outlined above.

Going with the default chunk size of 64k, that's 128 x 512bytes.  512
/ 4 = 128 chunks for 32bit checksums, 512/16 = 32 for 128 bit
checksums, or 16 for 256 bit ones.  A proposed reduction in capacity,
respectively (and truncated, not rounded), to 99.993%, 99.975%, and
99.951% of the otherwise usable storage space. (Using
(SectorsPerChunk*ChunksThatFit)/(SectorsPerChunk*ChunksThatFit+1) to
give the data that would fit in the same sized 'old space' but first
making the 'old space' one sector larger to simplify the calculation.
Obviously the offsets and packing change if you choose a larger sector
size but keep the chunk size the same.)

With multi-core systems I don't see a major downside on the extra
processing workload, but I do see the disk bottle-neck as a downside.
The most natural read pattern would be to slurp up the remaining
sectors to the CRC payload on each drive.  For large streamed files
the effect is likely good.  For databases and small-file tasks that
cache poorly this would probably be bad.

Just arbitrarily making the chunk one sector larger only drops the
storage ratio to 99.224% and eliminates that problem (plus provides
the remainder of those sectors to dedicate to additional payload).

The error detection/recovery payload could better be expressed as
taking the minimum (and most frequently occurring) form of device
block size; but tunable up from there just as chunk size is.  This way
it could work for media that doesn't suffer seek time, but does have
data-alignment issues.

>From a logical view both the existing MD/RAID and these additional
validity/recovery (likely what the unused space in the per-chunk
checksum storage model could be filled with) methods should exist in
some kind of block device reliability layer.  The type of reliability
that MD currently provides would be a simple whole-partition/device
approach.  The kind of reliability that BTRFS and ZFS aim for are a
more granular approach.  Obviously a place where code and complexity
can be reduced by using a common interface and consolidating code.
Additional related improvements might also be exposed via that
interface.

Is that something others would value as a valid contribution?  I'm
actually thinking of looking in to this, but don't really want to
expend the effort if it's unlikely to be anything more than a local
patch trying to anchor me to the past (when I'm not at least being
paid to continually port it forward).
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html