Re: Is it possible to corrupt disk when writeback page with undetected UE?

Jane Chu <jane.chu@xxxxxxxxxx> · Fri, 16 Sep 2022 19:26:01 +0000

On 9/16/2022 9:17 AM, Luck, Tony wrote:
>> Were you using madvise to inject an error to a mmap'ed address?
>> or a different tool?  Do you still have the test documented
>> somewhere?
> 
> I was injecting with ACPI/EINJ (so tweaking some ECC bits in memory to create
> a real uncorrectable error). This was a long time back when I was just trying to
> get basic recovery from usermode access to poison working reliably. So I just
> noted the workaround ("make; sync; run_test") to keep making progress.
> 
> Handling poison in the page cache has been on my TODO list for a long time.
> Someday it will make it to the top.

I see, looking forward to your patches.

> 
>> And, aside from verifying every write with a read prior to sync,
>> any suggestion to minimize the window of such corruption?
> 
> There's no cheap solution. As you point out the best that can be done
> is to reduce the window (since bits may get flipped after you perform
> your check but before DMS to storage).

Sounds like the disk controller is the last in the chain in terms
of detecting a late UE, so if the disk controller detection could
trickle up to a filesystem level action marking 'file:<offset,len>'
being bad and relate the information to user for repair, that might be 
reasonable...

thanks,
-jane

> 
> -Tony
>