On 6/24/2020 5:13 PM, Luck, Tony wrote:
Both the RFC patch and the above 5-step recovery plan look neat, step 4)
is nice to carry forward on icelake when a single instruction to clear
poison is available.
Jane,
Clearing poison has some challenges.
On persistent memory it probably works (as the DIMM is going to remap that address to a different
part of the media to avoid the bad spot).
On DDR memory you'd need to decide whether the problem was transient, so that a simple
overwrite fixes the problem. Or persistent ... in which case the problem will likely come back
with the right data pattern. To tell that you may need to run some memory test on the affected
area.
If the error was just in a 4K page, I'd be inclined to copy the good data to a new page and
map that in instead. Throwing away one 4K page isn't likely to be painful.
If it is in a 2M/1G page ... perhaps it is worth the effort and risk of trying to clear the poison
in place to avoid the pain of breaking up a large page.
Thanks! Yes I was only thinking about persistent memory, but
memory_failure_dev_pagemap() applies to DDR as well depends on the
underlying technology. In our use case, even if the error was just in a
4K page, we'd like to clear the poison and reuse the page to maintain a
contiguous 256MB extent in the filesystem. Perhaps it is better to
leave that to the filesystem and driver.
Regards,
-jane
-Tony