On Thu, Sep 5, 2019 at 10:15 AM Robin Hill <robin@xxxxxxxxxxxxxxx> wrote: > > > I'm not clear what you (or the author of the article) are expecting > here. You've got a disk (or disks) with thousands of read errors - > whether these are dm-integrity mismatches or disk-level read errors > doesn't matter - the disk is toast and needs replacing ASAP (or it's in > an environment where it - and you - probably shouldn't be). That sounds to me like a policy question. The kernel code should be able to handle the errors, including even rate limiting if the errors are massive. It's a policy question whether X number errors per unit time, or Y:Z ratio bad to good sectors have been read, is the limit. And it's reasonable for md developers to pick a sane default for that policy. But to just say 1000's of corruptions are inherently a device failure, when easily 1 million more in the same time frame are good? You'd be giving up a better chance of recovery during rebuilds/device replacements by flat out ejecting such a device. Also the device could be network. It could be transient. Or the problem discovered and fixed way before the device is ejected, and manually readded and rebuilt. > Admittedly, with dm-integrity we can probably trust that anything read > from the disk which makes it past the integrity check is valid, so there > may be cases where the data on there is needed to complete a stripe. > That seems a rather theoretical and contrived circumstance though - in > most cases you're better just kicking the drive from the array so the > admin knows that it needs replacing. I don't agree that a heavy hammer is needed in order to send a notification. -- Chris Murphy