Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2024/6/22 4:44, Luck, Tony 写道:
So who actually cares about recovering poisoned volatile memory?
I'd like to understand more on how significant a use case this is.
Whilst I can conjecture that its an extreme case of wanting to avoid
loosing the ability to create 1GiB or larger pages due to poison
is that a real problem for anyone today?  Note this is just the case
where you've reached an actual uncorrectable error and probably
/ possibly killed something, not the more common soft offlining
of memory due to correctable errors being detected.

I guess you really need a reply from someone with a data center
with thousands of machines, since that's where this question
may be important.

My humble opinion is that, outside of the huge page issue, nobody
should try to recover a poisoned page. Systems that can report
and recover from poison have tens, hundreds, or more GBytes
of memory. Dropping 4K pages will not have any measurable
impact on a system (even if there are hundreds of pages dropped).

There's no reliable way to determine whether the poisoned page
was due to some transient issue, or a permanent defect. Recovering
a poisoned page runs the risk that the poison will re-occur. Perhaps
next use of the page will be in some unrecoverable (kernel) context.

So recovery has some risk, but very little upside benefit.

Since the hardware provides the instruction(CPU)/command(CXL) to clear the poison, we could make the function work, at least as an optional feature. Then users could decide to use it or not after evaluating the risk and benefit.

I think doing recovery is an improvement step, and may need a lot of discussion. I'm not sure if we could reach a conclusion in this thread. Just hope more comments on the original problem (duplicate report) to solve in this patch.


--
Thanks,
Ruan.


-Tony




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux