The ever increasing server size and cost in DRAM has brought the memory subsystem reliability to the forefront of large fleet owners’ concern. Memory-error-caused server and workload crashes are ranked #1 among all hardware failures by a large margin. Deploying extra reliable DRAM adds significant cost at a fleet size, e.g, 10% extra cost on DRAM can amount to hundreds of millions worth of dollars spending. “Reactive” memory poison recovery [3], i.e., recover from MCEs raised during an execution context (the kernel mechanisms are MCE handler + CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found effective in keeping systems resilient from memory errors. However, it has several major drawbacks: 1. It requires software systems that access poisoned memory to be specifically designed and implemented to recover from memory errors: . Uncorrectable errors are random, which may happen outside of the enlightened address spaces or execution contexts. . The added error recovery capability comes at a cost of added complexity and often not possible to enlighten in 3rd party software. 2. In a virtualized environment, the injected MCEs introduce the same challenge to the guest. 3. Because of random execution contexts, CPU erratum that are vulnerable to speculative execution, split cache line accesses, hyperthread buddy scheduling etc (e.g., [1]) can often turn a recoverable UC error into an unrecoverable MCE and a system crash. 4. Aside from CPU accesses, NIC or other PCIe devices accessing poisoned memory cause host crashes in the production system regularly. 5. In a multi-tenant environment, the “reactive” poison recovery is less effective to the largest workloads compared to the smaller workloads in that the smaller workloads have a much higher chance to get saved cleanly as a victim’s neighbor rather than the victim itself. The goal is to minimize the probability that any software / hardware component actually gets a chance to consume an error. A possible solution is to “proactively” look for and detect memory errors before consumption. Here we assume system software is enlightened to drain the affected host and migrate the running jobs off to another healthy host as soon as an error is detected; and is able to recover from, contain, and emulate the errors that surface up during the migration process. The main benefits (free memory ratios come from a large production fleet): 1. Memory errors in free memory (~50%) can be completely contained without impacting software / hardware systems later on. 2. Inside a VM guest, memory errors in free memory (~50%) can be completely contained with the UCNA injection via CMCI capability being added to KVM in [2]. 3. It’s required to detect memory errors on allocated pages without impacting the execution or performance of the page owners. For instance, in the cloud world, the majority of the host memory is pre-allocated as a guest memory pool and memory errors can emerge well after the guest memory pool allocation. 4. Early detection and ensured containment (e.g., unmapping and PG_HWPOISON) can effectively prevent most if not all the crashes due to CPU erratum ([3], section 3.5.2 - 3.5.5) that “reactive” poison recovery cannot avoid and these crashes represent >40% of all host crashes in a production fleet. The hardware patrol scrubber [4] is evaluated, and the type of performance (i.e., latency between error emergence and detection are in days to hours) and error loss rate due to downgrading (or otherwise system instability due to SRAO MCE broadcast and overflow) do not meet the requirements (e.g., ~30 min emergence to consumption based on simulation). A possible solution is to have some specially purposed poison detector userspace agent that proactively looks for memory errors by invoking some ioctl specifically implemented to avoid the CPU erratum, minimize performance interferences (cache pollution etc) and avoid leaking the memory content into the registers. The detector agent runs with minimal configurable resource consumption (e.g., 0.1 core / socket, <0.5% membw consumption etc) and pauses itself when the host system is under heavy load (e.g., CPU>90% or membw>75%). The kernel ioctl may take the following form and a potential point of discussion is whether Unmapping guest Private Memory will require zapping the kernel direct map or not. This ioctl can be compiled off in case incompatible with other use cases like UPM. /* Could stop and return after the 1st poison is detected */ #define MCESCAN_IOCTL_SCAN 0 struct SysramRegion { /* input */ uint64_t first_byte; /* first page-aligned physical address to scan */ uint64_t length; /* page-aligned length of memory region to scan */ /* output */ uint32_t poisoned; /* 1 - a poisoned page is found, 0 - otherwise */ uint32_t poisoned_pfn; /* PFN of the 1st detected poisoned page */ } 1. https://lore.kernel.org/lkml/164529415398.16921.8042682039148828519.tip-bot2@tip-bot2/ 2. https://lore.kernel.org/kvm/20220412223134.1736547-1-juew@xxxxxxxxxx/ 3. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/reduce-server-crash-rate-tencent-paper.pdf -- 2.36.0.rc2.479.g8af0fa9b8e-goog