On Thu Sep 05, 2019 at 12:19:19PM -0600, Chris Murphy wrote: > On Thu, Sep 5, 2019 at 10:15 AM Robin Hill <robin@xxxxxxxxxxxxxxx> wrote: > > > > > > I'm not clear what you (or the author of the article) are expecting > > here. You've got a disk (or disks) with thousands of read errors - > > whether these are dm-integrity mismatches or disk-level read errors > > doesn't matter - the disk is toast and needs replacing ASAP (or it's in > > an environment where it - and you - probably shouldn't be). > > That sounds to me like a policy question. The kernel code should be > able to handle the errors, including even rate limiting if the errors > are massive. It's a policy question whether X number errors per unit > time, or Y:Z ratio bad to good sectors have been read, is the limit. > And it's reasonable for md developers to pick a sane default for that > policy. But to just say 1000's of corruptions are inherently a device > failure, when easily 1 million more in the same time frame are good? > You'd be giving up a better chance of recovery during rebuilds/device > replacements by flat out ejecting such a device. Also the device could > be network. It could be transient. Or the problem discovered and fixed > way before the device is ejected, and manually readded and rebuilt. > It's definitely a policy question, yes, and more flexibility in how these errors are handled would indeed be good. The specific cases here are thousands of integrity mismatches artificially introduced into sequential blocks covering half the device though. I don't see any reasonable error-handling method doing anything other than kicking the drive in that case. Ignoring them on the basis that they're dm-integrity mismatches rather than read errors reported from the drive does not sound like the right fix (unless we're expecting dm-integrity, or the block-layer generally, to have built-in error counting and device-failing?). Also, a transient issue of this size is likely to cause the drive to be kicked anyway - don't forget that each of these read errors will trigger a write, and if that fails the drive is kicked regardless of whether it's the first error or the thousandth. > > Admittedly, with dm-integrity we can probably trust that anything read > > from the disk which makes it past the integrity check is valid, so there > > may be cases where the data on there is needed to complete a stripe. > > That seems a rather theoretical and contrived circumstance though - in > > most cases you're better just kicking the drive from the array so the > > admin knows that it needs replacing. > > I don't agree that a heavy hammer is needed in order to send a notification. > You think that most people using this will be monitoring for dm-intergity reported errors? If all the errors are just rewritten silently then it's likely the only sign of an issue will be a performance impact, with no obvious sign as to where it's coming from. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@xxxxxxxxxxxxxxx> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" |