On Fri, Jan 20, 2023 at 03:46:20AM +0000, Jiaqi Yan wrote: > Today kernel provides following memory error info to userspace, but each > has its own disadvantage > * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, > not per NUMA node stats though > * ras:memory_failure_event: only available after explicitly enabled > * /dev/mcelog provides many useful info about the MCEs, but > doesn't capture how memory_failure recovered memory MCEs > * kernel logs: userspace needs to process log text > > Exposes per NUMA node memory error stats as sysfs entries: > > /sys/devices/system/node/node${X}/memory_failure/total > /sys/devices/system/node/node${X}/memory_failure/recovered > /sys/devices/system/node/node${X}/memory_failure/ignored > /sys/devices/system/node/node${X}/memory_failure/failed > /sys/devices/system/node/node${X}/memory_failure/delayed > > These counters describe how many raw pages are poisoned and after the > attempted recoveries by the kernel, their resolutions: how many are > recovered, ignored, failed, or delayed respectively. The following > math holds for the statistics: > * total = recovered + ignored + failed + delayed > > Acked-by: David Rientjes <rientjes@xxxxxxxxxx> > Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> Looks good to me, thank you. Acked-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>