Re: [RFC] Make the memory failure blast radius more precise

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/24/2020 5:13 PM, Luck, Tony wrote:
Both the RFC patch and the above 5-step recovery plan look neat, step 4)
is nice to carry forward on icelake when a single instruction to clear
poison is available.

Jane,

Clearing poison has some challenges.

On persistent memory it probably works (as the DIMM is going to remap that address to a different
part of the media to avoid the bad spot).

On DDR memory you'd need to decide whether the problem was transient, so that a simple
overwrite fixes the problem. Or persistent ... in which case the problem will likely come back
with the right data pattern.  To tell that you may need to run some memory test on the affected
area.

If the error was just in a 4K page, I'd be inclined to copy the good data to a new page and
map that in instead. Throwing away one 4K page isn't likely to be painful.

If it is in a 2M/1G page ... perhaps it is worth the effort and risk of trying to clear the poison
in place to avoid the pain of breaking up a large page.

Thanks! Yes I was only thinking about persistent memory, but memory_failure_dev_pagemap() applies to DDR as well depends on the underlying technology. In our use case, even if the error was just in a 4K page, we'd like to clear the poison and reuse the page to maintain a contiguous 256MB extent in the filesystem. Perhaps it is better to leave that to the filesystem and driver.

Regards,
-jane


-Tony





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux