Re: raid6 with dm-integrity should not cause device to fail

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 5 Sep 2019 12:19:19 -0600

On Thu, Sep 5, 2019 at 10:15 AM Robin Hill <robin@xxxxxxxxxxxxxxx> wrote:
>
>
> I'm not clear what you (or the author of the article) are expecting
> here. You've got a disk (or disks) with thousands of read errors -
> whether these are dm-integrity mismatches or disk-level read errors
> doesn't matter - the disk is toast and needs replacing ASAP (or it's in
> an environment where it - and you - probably shouldn't be).

That sounds to me like a policy question. The kernel code should be
able to handle the errors, including even rate limiting if the errors
are massive. It's a policy question whether X number errors per unit
time, or Y:Z ratio bad to good sectors have been read, is the limit.
And it's reasonable for md developers to pick a sane default for that
policy. But to just say 1000's of corruptions are inherently a device
failure, when easily 1 million more in the same time frame are good?
You'd be giving up a better chance of recovery during rebuilds/device
replacements by flat out ejecting such a device. Also the device could
be network. It could be transient. Or the problem discovered and fixed
way before the device is ejected, and manually readded and rebuilt.

> Admittedly, with dm-integrity we can probably trust that anything read
> from the disk which makes it past the integrity check is valid, so there
> may be cases where the data on there is needed to complete a stripe.
> That seems a rather theoretical and contrived circumstance though - in
> most cases you're better just kicking the drive from the array so the
> admin knows that it needs replacing.

I don't agree that a heavy hammer is needed in order to send a notification.

-- 
Chris Murphy