On 02/02/18 15:40, David Brown wrote: > On 02/02/18 16:03, Wols Lists wrote: >> On 02/02/18 14:50, David Brown wrote: >>> What are these cases? We have already eliminated the rebuild situation >>> I described. And in particular, which use-cases are you thinking of >>> where you not be better off with alternative integrity improvements >>> (like higher redundancy levels) without killing performance? >>> >> In particular, when you KNOW you've got a damaged raid, and you want to >> know which files are affected. The whole point of my technique is that >> either it uses the raid to recover (if it can) or it propagates a read >> error back to the application. It does NOT "fix" the data and leave a >> corrupted file behind. > > If you read a block and the read fails, the raid system will already > read the whole stripe to re-create the missing data. If it can > re-create it, it writes the new data back to the disk and returns it to > the application. If it cannot, it gives the read error back to the > application. > > I cannot imagine a situation where you would have a disk that you know > has incorrect data, as part of your array and in normal use for a file > system. Can't you? When I was discussing this originally I had a bunch of examples given to me. Let's take just one, which as far as I can tell is real, and is probably far more common than system developers would like to admit. A drive glitches, and writes a load of data - intended for let's say track 1398 - to track 1938 by mistake. Okay, that particular example is a decimal blunder, and a drive would probably make a bit-flip mistake instead, but writing data to the wrong place is apparently a well-recognised intermittent failure mode. (And it's not even always hardware to blame - just an unfortunate cosmic ray incident.) Or - and it was reported on this list - a drive suffers a power glitch and dumps the entire contents of its write buffer. Either way, we now have a raid array which APPEARS to be functioning normally, and a bunch of stripes are corrupt. If you're lucky (and yes, this does seem to be the normal state of affairs) then it's just the parity which has been corrupted, which a scrub will fix. But if it's not the parity, then raid-1 and raid-5 you can kiss your data bye-bye, and if it's raid-6, a scrub will send your data to data heaven. And saying "it's never happened to me" doesn't mean it's never happened to anyone else. Let's go back a few years, to the development of the ext file system from version 2, to version 4. I can't remember the exact saying, but it's something along the lines of "premature optimisation is the root of all evil". When an ext2 system crashed, you could easily spend hours running fsck before the system was usable. So the developers developed ext3, with a journal. By chance, this always wrote the data blocks before the journal, so when the system crashed, the journal fixed the file system, and the users were very happy they didn't need a fsck. Then the developers decided to optimise further into ext4 and broke the link between data and journal! So now, an ext4 system might boot faster after a crash, shaving seconds off journal replay time. But the system took MUCH LONGER to be available to users, because now the filesystem corrupted user data, and instead of running the system level fsck, users had to replace it with an application data integrity tool. So yes, my "integrity checking raid" might be slow. Which is why it would be disabled by default, and require flipping a runtime switch to enable it. But it's a hell of a lot faster than an "mkfs and reload from backup", which is the alternative if your disk is corrupt (as opposed to crashed and dead). And my way gives you a list of corrupted files that need restoring, as opposed to "scrub, fix, and cross your fingers". And one last question - if my idea is stupid, why did somebody think it worthwhile to write raid6check? Why is it that so many kernel level guys seem to treat user data integrity with contempt? Cheers, Wol -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html