This RFC is a followup for [1]. We’d like to first revisit the problem statement, then explain the motivation for kernel support of memory error detection. We attempt to answer two key questions raised in the initial memory-scanning based solution: what memory to scan and how the scanner should be designed. Different from what [1] originally proposed, we think a kernel-driven design similar to khugepaged/kcompactd would work better than the userspace-driven design. Problem Statement ================= The ever increasing DRAM size and cost has brought the memory subsystem reliability to the forefront of large fleet owners’ concern. Memory errors are one of the top hardware failures that cause server and workload crashes. Simply deploying extra-reliable DRAM hardware to a large-scale computing fleet adds significant cost, e.g., 10% extra cost on DRAM can amount to hundreds of millions of dollars. Reactive memory poison recovery (MPR), e.g., recovering from MCEs raised during an execution context (the kernel mechanisms are MCE handler + CONFIG_MEMORY_FAILURE + SIGBUS to the user space process), has been found effective in keeping systems resilient from memory errors. However, reactive memory poison recovery has several major drawbacks: - It requires software systems that access poisoned memory to be specifically designed and implemented to recover from memory errors. Uncorrectable (UC) errors are random, which may happen outside of the enlightened address spaces or execution contexts. The added error recovery capability comes at the cost of added complexity and often impossible to enlighten in 3rd party software. - In a virtualized environment, the injected MCEs introduce the same challenge to the guest. - It only covers MCEs raised by CPU accesses, but the scope of memory error issue is far beyond that. For example, PCIe devices (e.g. NIC and GPU) accessing poisoned memory cause host crashes when on certain machine configs. We want to upstream a patch set that proactively scans the memory DIMMs at a configurable rate to detect UC memory errors, and attempts to recover the detected memory errors. We call it proactive MPR, which provides three benefits to tackle the memory error problem: - Proactively scanning memory DIMMs reduces the chance of a correctable error becoming uncorrectable. - Once detected, UC errors caught in unallocated memory pages are isolated and prevented from being allocated to an application or the OS. - The probability of software/hardware products encountering memory errors is reduced, as they are only exposed to memory errors developed over a window of T, where T stands for the period of scrubbing the entire memory space. Any memory errors that occurred more than T ago should have resulted in custom recovery actions. For example, in a cloud environment VMs can be live migrated to another healthy host. Some CPU vendors [2, 3] provide hardware patrol scrubber (HPS) to prevent the build up of memory errors. In comparison software memory error detector (SW) has pros and cons: - SW supports adaptive scanning, i.e. speeds up/down scanning, turns on/off scanning, and yields its own CPU cycles and memory bandwidth. All of these can happen on-the-fly based on the system workload status or administrator’s choice. HPS doesn’t have all these flexibilities. Its patrol speed is usually only configurable at boot time, and it is not able to consider system state. (Note: HPS is a memory controller feature and usually doesn’t consume CPU time). - SW can expose controls to scan by memory types, while HPS always scans full system memory. For example, an administrator can use SW to only scan hugetlb memory on the system. - SW can scan memory at a finer granularity, for example, having different scan rate per node, or entirely disabled on some node. HPS, however, currently only supports per host scanning. - SW can make scan statistics (e.g. X bytes has been scanned for the last Y seconds and Z memory errors are found) easily visible to datacenter administrators, who can schedule maintenance (e.g. migrating running jobs before repairing DIMMs) accordingly. - SW’s functionality is consistent across hardware platforms. HPS’s functionality varies from vendor to vendor. For example, some vendors support shorter scrubbing periods than others, and some vendors may not support memory scrubbing at all. - HPS usually doesn’t consume CPU cores but does consume memory controller cycles and memory bandwidth. SW consumes both CPU cycles and memory bandwidth, but is only a problem if administrators opt into the scanning after weighing the cost benefit. - As CPU cores are not consumed by HPS, there won’t be any cache impact. SW can utilize prefetchnta (for x86) [4] and equivalent hints for other architectures [5] to minimize cache impact (in case of prefetchnta, completely avoiding L1/L2 cache impact). Solution Proposals ================== What to Scan ============ The initial RFC proposed to scan the **entire system memory**, which raised the question of what memory is scannable (i.e. memory accessible from kernel direct mapping). We attempt to address this question by breaking down the memory types as follows: - Static memory types: memory that either stays scannable or unscannable. Well defined examples are hugetlb vs regular memory, node-local memory vs far memory (e.g. CXL or PMEM). While most static memory types are scannable, administrators could disable scanning far memory to avoid messing with the promotion and demotion logic in memory tiring solutions. (The implementation will allow administrators to disable scanning on scannable memory). - Memory type related to virtualization, including ballooned-away memory and unaccepted memory. Not all balloon implementations are compatible with memory scanning (i.e. reading memory mapped into the direct map) in guest. For example, with the virtio-mem devices [6] in the hypervisor, reading unplugged memory can cause undefined behavior. The same applies to unaccepted memory in confidential VMs [7]. Since memory error detection on the host side already benefits its guests transparently, (i.e., spending no extra guest CPU cycle), there is very limited benefit for a guest to scan memory by itself. We recommend disabling the memory error detection within the virtualization world. - Dynamic memory type: memory that turns into unscannable or scannable dynamically. One significant example is guest private memory backing confidential VM. At the software level, guest private memory pages become unscannable as they will soon be unmapped from kernel direct mapping [8]. Scanning guest private memory pages is still possible by IO remapping with foreseen performance sacrifice and proper page fault handling (skip the page) if all means of mapping fail. At the hardware level, the memory access implementation done by hardware vendors today puts memory integrity checking prior to memory ownership checking, which means memory errors are still surfaced to the OS while scanning. For the scanning scheme to work for the future, we need the hardware vendors to keep providing similar error detection behavior in their confidential VM hardware. We believe this is a reasonable ask to them as their hardware patrol scrubbers also adopt the same scanning scheme and therefore rely on such promise from themselves. Otherwise we can switch to whatever the new scheme used by the patrol scrubbers when they break the promise. How Scanning is Designed ==================== We can support kernel memory error detection in two styles: whether kernel itself or userspace application drives the detection. In the first style, the kernel itself can create a kthread on each NUMA node for scanning node-local memory (i.e. IORESOURCE_SYSTEM_RAM). These scanning kthreads are scheduled in the way similar to how khugepaged or kcompactd works. Basic configurations of the ever-schedulable background kthreads can be exposed to userspace via sysfs, for example, sleeping for X milliseconds after scanning Y raw memory pages. Scanning statistics can also be visible to userspace via sysfs, for example, number of pages actually scanned and number of memory errors found. On the other hand, memory error detection can be driven by root userspace applications with sufficient support from the kernel. For example, a process can scrutinize physical memory under its own virtual memory space on demand. The supports from kernel are the most basic operations of specifically designed memory read access (e.g. avoid the CPU erratum, minimize cache pollution, and avoid leaking the memory content etc), and machine check exception handling plus memory failure handling [9] when memory error is detected. The pros and cons of in-kernel background scanning are: - A simple and independent component for scanning system memory constantly and regularly, which improves the machine fleet’s memory health (e.g., for hyperscalers, cloud providers, etc). - The rest of the OS (both kernel and application) can benefit from it without explicit modifications. - The efficiency of this approach is easily configurable by scan rate. - It cannot offer an on-the-spot guarantee. There is no good way to prioritize certain chunks of memory. - The implementation of this approach needs to deal with the question of if a memory page is scannable. The pros and cons of application driven approach are: - An application can scan a specific chunk of memory on the spot, and is able to prioritize scanning on some memory regions or memory types. - A memory error detection agent needs to be designed to proactively, constantly and regularly scan the entire memory. - A kernel API needs to be designed to provide userspace enough power of scanning physical memory. For example, Memory regions requested by multiple applications may overlap. Should the kernel API support combined scanning? - Application is exposed to the question of if memory is scannable, and needs to deal with the complexity of ensuring memory stays scannable during the scanning process. We prefer the in-kernel background approach for its simplicity, but open to all opinions from the upstream community. [1] https://lore.kernel.org/linux-mm/20220425163451.3818838-1-juew@xxxxxxxxxx [2] https://developer.amd.com/wordpress/media/2012/10/325591.pdf [3] https://community.intel.com/t5/Server-Products/Uncorrectable-Memory-Error-amp-Patrol-Scrub/td-p/545123 [4] https://www.amd.com/system/files/TechDocs/24594.pdf, page 285 [5] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair [6] https://lore.kernel.org/kvm/20200311171422.10484-1-david@xxxxxxxxxx [7] https://lore.kernel.org/linux-mm/20220718172159.4vwjzrfthelovcty@xxxxxxxxxxxxxxxxxx/t/ [8] https://lore.kernel.org/linux-mm/20220706082016.2603916-1-chao.p.peng@xxxxxxxxxxxxxxx [9] https://www.kernel.org/doc/Documentation/vm/hwpoison.rst -- 2.38.1.273.g43a17bfeac-goog