On Tue, Jan 17, 2023 at 1:19 AM HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi@xxxxxxx> wrote: > > On Mon, Jan 16, 2023 at 07:38:59PM +0000, Jiaqi Yan wrote: > > Background > > ========== > > In the RFC for Kernel Support of Memory Error Detection [1], one advantage > > of software-based scanning over hardware patrol scrubber is the ability > > to make statistics visible to system administrators. The statistics > > include 2 categories: > > * Memory error statistics, for example, how many memory error are > > encountered, how many of them are recovered by the kernel. Note these > > memory errors are non-fatal to kernel: during the machine check > > exception (MCE) handling kernel already classified MCE's severity to > > be unnecessary to panic (but either action required or optional). > > * Scanner statistics, for example how many times the scanner have fully > > scanned a NUMA node, how many errors are first detected by the scanner. > > > > The memory error statistics are useful to userspace and actually not > > specific to scanner detected memory errors, and are the focus of this RFC. > > > > Motivation > > ========== > > Memory error stats are important to userspace but insufficient in kernel > > today. Datacenter administrators can better monitor a machine's memory > > health with the visible stats. For example, while memory errors are > > inevitable on servers with 10+ TB memory, starting server maintenance > > when there are only 1~2 recovered memory errors could be overreacting; > > in cloud production environment maintenance usually means live migrate > > all the workload running on the server and this usually causes nontrivial > > disruption to the customer. Providing insight into the scope of memory > > errors on a system helps to determine the appropriate follow-up action. > > In addition, the kernel's existing memory error stats need to be > > standardized so that userspace can reliably count on their usefulness. > > > > Today kernel provides following memory error info to userspace, but they > > are not sufficient or have disadvantages: > > * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total, > > not per NUMA node stats though > > * ras:memory_failure_event: only available after explicitly enabled > > * /dev/mcelog provides many useful info about the MCEs, but doesn't > > capture how memory_failure recovered memory MCEs > > * kernel logs: userspace needs to process log text > > > > Exposing memory error stats is also a good start for the in-kernel memory > > error detector. Today the data source of memory error stats are either > > direct memory error consumption, or hardware patrol scrubber detection > > (when signaled as UCNA; these signaled as SRAO are not handled by > > memory_failure). > > Sorry, I don't follow this "(...)" part, so let me question. I thought that > SRAO events are handled by memory_failure and UCNA events are not, so does > this say the opposite? I think UCNA is definitely handled by memory failure, but I was not correct about SRAO. According to Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Section 15.6.3: SRAO can be signaled via **either via MCE or CMCI***. For SRAO signaled via **machine check exception**, my reading of the current x86 MCE code is this: 1) kill_current_task is init to 0, and as long as restart IP is valid (MCG_STATUS_RIPV = 1), it remains 0: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1473 2) after classifying severity, worst should be MCE_AO_SEVERITY: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1496 3) therefore, do_machine_check just skips kill_me_now or kill_me_maybe, and directly goto out: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n1539 For UCNA and SRAO signled via CMCI, CMCI handler should eventually calls into memory_failure via uc_decode_notifier ((MCE_UCNA_SEVERITY==MCE_DEFERRED_SEVERITY): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/cpu/mce/core.c#n579 So it seems the signaling mechanism matters. > > Other than that, the whole description sounds nice and convincing to me. > Thank you for your work. > > - Naoya Horiguchi > > > Once in-kernel memory scanner is implemented, it will be > > the main source as it is usually configured to scan memory DIMMs constantly > > and faster than hardware patrol scrubber. > > > > How Implemented > > =============== > > As Naoya pointed out [2], exposing memory error statistics to userspace > > is useful independent of software or hardware scanner. Therefore we > > implement the memory error statistics independent of the in-kernel memory > > error detector. It exposes the following per NUMA node memory error > > counters: > > > > /sys/devices/system/node/node${X}/memory_failure/pages_poisoned > > /sys/devices/system/node/node${X}/memory_failure/pages_recovered > > /sys/devices/system/node/node${X}/memory_failure/pages_ignored > > /sys/devices/system/node/node${X}/memory_failure/pages_failed > > /sys/devices/system/node/node${X}/memory_failure/pages_delayed > > > > These counters describe how many raw pages are poisoned and after the > > attempted recoveries by the kernel, their resolutions: how many are > > recovered, ignored, failed, or delayed respectively. This approach can be > > easier to extend for future use cases than /proc/meminfo, trace event, > > and log. The following math holds for the statistics: > > * pages_poisoned = pages_recovered + pages_ignored + pages_failed + > > pages_delayed > > * pages_poisoned * page_size = /proc/meminfo/HardwareCorrupted > > These memory error stats are reset during machine boot. > > > > The 1st commit introduces these sysfs entries. The 2nd commit populates > > memory error stats every time memory_failure finishes memory error > > recovery. The 3rd commit adds documentations for introduced stats. > > > > [1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@xxxxxxxxx/T/#mc22959244f5388891c523882e61163c6e4d703af > > [2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@xxxxxxxxx/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6 > > > > Jiaqi Yan (3): > > mm: memory-failure: Add memory failure stats to sysfs > > mm: memory-failure: Bump memory failure stats to pglist_data > > mm: memory-failure: Document memory failure stats > > > > Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++ > > drivers/base/node.c | 3 + > > include/linux/mm.h | 5 ++ > > include/linux/mmzone.h | 28 ++++++++ > > mm/memory-failure.c | 71 +++++++++++++++++++++ > > 5 files changed, 146 insertions(+) > > > > -- > > 2.39.0.314.g84b9a713c41-goog