On Mon, Jan 16, 2023 at 07:38:59PM +0000, Jiaqi Yan wrote: > Background > ========== > In the RFC for Kernel Support of Memory Error Detection [1], one advantage > of software-based scanning over hardware patrol scrubber is the ability > to make statistics visible to system administrators. The statistics > include 2 categories: > * Memory error statistics, for example, how many memory error are > encountered, how many of them are recovered by the kernel. Note these > memory errors are non-fatal to kernel: during the machine check > exception (MCE) handling kernel already classified MCE's severity to > be unnecessary to panic (but either action required or optional). > * Scanner statistics, for example how many times the scanner have fully > scanned a NUMA node, how many errors are first detected by the scanner. > > The memory error statistics are useful to userspace and actually not > specific to scanner detected memory errors, and are the focus of this RFC. > > Motivation > ========== > Memory error stats are important to userspace but insufficient in kernel > today. Datacenter administrators can better monitor a machine's memory > health with the visible stats. For example, while memory errors are > inevitable on servers with 10+ TB memory, starting server maintenance > when there are only 1~2 recovered memory errors could be overreacting; > in cloud production environment maintenance usually means live migrate > all the workload running on the server and this usually causes nontrivial > disruption to the customer. Providing insight into the scope of memory > errors on a system helps to determine the appropriate follow-up action. > In addition, the kernel's existing memory error stats need to be > standardized so that userspace can reliably count on their usefulness. > > Today kernel provides following memory error info to userspace, but they > are not sufficient or have disadvantages: > * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, > not per NUMA node stats though > * ras:memory_failure_event: only available after explicitly enabled > * /dev/mcelog provides many useful info about the MCEs, but doesn't > capture how memory_failure recovered memory MCEs > * kernel logs: userspace needs to process log text > > Exposing memory error stats is also a good start for the in-kernel memory > error detector. Today the data source of memory error stats are either > direct memory error consumption, or hardware patrol scrubber detection > (when signaled as UCNA; these signaled as SRAO are not handled by > memory_failure). Sorry, I don't follow this "(...)" part, so let me question. I thought that SRAO events are handled by memory_failure and UCNA events are not, so does this say the opposite? Other than that, the whole description sounds nice and convincing to me. Thank you for your work. - Naoya Horiguchi > Once in-kernel memory scanner is implemented, it will be > the main source as it is usually configured to scan memory DIMMs constantly > and faster than hardware patrol scrubber. > > How Implemented > =============== > As Naoya pointed out [2], exposing memory error statistics to userspace > is useful independent of software or hardware scanner. Therefore we > implement the memory error statistics independent of the in-kernel memory > error detector. It exposes the following per NUMA node memory error > counters: > > /sys/devices/system/node/node${X}/memory_failure/pages_poisoned > /sys/devices/system/node/node${X}/memory_failure/pages_recovered > /sys/devices/system/node/node${X}/memory_failure/pages_ignored > /sys/devices/system/node/node${X}/memory_failure/pages_failed > /sys/devices/system/node/node${X}/memory_failure/pages_delayed > > These counters describe how many raw pages are poisoned and after the > attempted recoveries by the kernel, their resolutions: how many are > recovered, ignored, failed, or delayed respectively. This approach can be > easier to extend for future use cases than /proc/meminfo, trace event, > and log. The following math holds for the statistics: > * pages_poisoned = pages_recovered + pages_ignored + pages_failed + > pages_delayed > * pages_poisoned * page_size = /proc/meminfo/HardwareCorrupted > These memory error stats are reset during machine boot. > > The 1st commit introduces these sysfs entries. The 2nd commit populates > memory error stats every time memory_failure finishes memory error > recovery. The 3rd commit adds documentations for introduced stats. > > [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@xxxxxxxxx/T/#mc22959244f5388891c523882e61163c6e4d703af > [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@xxxxxxxxx/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 > > Jiaqi Yan (3): > mm: memory-failure: Add memory failure stats to sysfs > mm: memory-failure: Bump memory failure stats to pglist_data > mm: memory-failure: Document memory failure stats > > Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++ > drivers/base/node.c | 3 + > include/linux/mm.h | 5 ++ > include/linux/mmzone.h | 28 ++++++++ > mm/memory-failure.c | 71 +++++++++++++++++++++ > 5 files changed, 146 insertions(+) > > -- > 2.39.0.314.g84b9a713c41-goog