在 2024/6/22 4:44, Luck, Tony 写道:
So who actually cares about recovering poisoned volatile memory?
I'd like to understand more on how significant a use case this is.
Whilst I can conjecture that its an extreme case of wanting to avoid
loosing the ability to create 1GiB or larger pages due to poison
is that a real problem for anyone today? Note this is just the case
where you've reached an actual uncorrectable error and probably
/ possibly killed something, not the more common soft offlining
of memory due to correctable errors being detected.
I guess you really need a reply from someone with a data center
with thousands of machines, since that's where this question
may be important.
My humble opinion is that, outside of the huge page issue, nobody
should try to recover a poisoned page. Systems that can report
and recover from poison have tens, hundreds, or more GBytes
of memory. Dropping 4K pages will not have any measurable
impact on a system (even if there are hundreds of pages dropped).
There's no reliable way to determine whether the poisoned page
was due to some transient issue, or a permanent defect. Recovering
a poisoned page runs the risk that the poison will re-occur. Perhaps
next use of the page will be in some unrecoverable (kernel) context.
So recovery has some risk, but very little upside benefit.
Since the hardware provides the instruction(CPU)/command(CXL) to clear
the poison, we could make the function work, at least as an optional
feature. Then users could decide to use it or not after evaluating the
risk and benefit.
I think doing recovery is an improvement step, and may need a lot of
discussion. I'm not sure if we could reach a conclusion in this thread.
Just hope more comments on the original problem (duplicate report) to
solve in this patch.
--
Thanks,
Ruan.
-Tony