On 9/16/2022 9:17 AM, Luck, Tony wrote: >> Were you using madvise to inject an error to a mmap'ed address? >> or a different tool? Do you still have the test documented >> somewhere? > > I was injecting with ACPI/EINJ (so tweaking some ECC bits in memory to create > a real uncorrectable error). This was a long time back when I was just trying to > get basic recovery from usermode access to poison working reliably. So I just > noted the workaround ("make; sync; run_test") to keep making progress. > > Handling poison in the page cache has been on my TODO list for a long time. > Someday it will make it to the top. I see, looking forward to your patches. > >> And, aside from verifying every write with a read prior to sync, >> any suggestion to minimize the window of such corruption? > > There's no cheap solution. As you point out the best that can be done > is to reduce the window (since bits may get flipped after you perform > your check but before DMS to storage). Sounds like the disk controller is the last in the chain in terms of detecting a late UE, so if the disk controller detection could trickle up to a filesystem level action marking 'file:<offset,len>' being bad and relate the information to user for repair, that might be reasonable... thanks, -jane > > -Tony >