On Fri, Jun 21, 2024 at 04:53:41PM -0700, Jiaqi Yan wrote: > Thanks for your comment, Andi. > > On Thu, Jun 20, 2024 at 3:53 PM Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote: > > > > Jiaqi Yan <jiaqiyan@xxxxxxxxxx> writes: > > > > > Correctable memory errors are very common on servers with large > > > amount of memory, and are corrected by ECC, but with two > > > pain points to users: > > > 1. Correction usually happens on the fly and adds latency overhead > > > 2. Not-fully-proved theory states excessive correctable memory > > > errors can develop into uncorrectable memory error. > > > > This patchkit is amusing (or maybe sad) because it basically tries to > > reconstruct the original soft offline design using a user space daemon > > instead of doing policy badly in the kernel. > > Some clarifications. I don't intend to reconstruct. I think this > patchset can also be treated as "patch some missing places so that > kernel doesn't soft offline behind the back of userspace daemon". > I agree with you (IIUC) that the policy for corrected memory errors > should exist in userspace. But the situation is that some behaviors in > the kernel don't respect that (they either have a reason to not > respect, or just forget to respect). enable_soft_offline is basically > the big button in userspace to block these kernel violators. It would be better to disable them earlier before they waste work tracking things unnecessarily. But yes it's a step in the right direction. > > > > > You can still have it by enabling CONFIG_X86_MCELOG_LEGACY and > > use http://www.mcelog.org or an equivalent daemon of your chosing > > that listens to /dev/mcelog. > > If I didn't miss anything important in > https://github.com/andikleen/mcelog and > arch/x86/kernel/cpu/mce/dev-mcelog.c, I don't think /dev/mcelog works > on ARM platforms where CPER is used to convey hw errors from platform > to OS. Yes or not on AMD even. -Andi