Introduction and Motivation =========================== Recently there is an enforcement on the userspace control over how kernel handles memory with corrected memory errors [1]. This RFC wants to extend userspace's control to how the kernel deals with uncorrectable memory errors, so userspace can now control all aspects of memory failure recovery (MFR). Why does userspace need to play a role in MFR? There are two major use cases, both from the cloud provider's perspective. The first use case is 1G HugeTLB, which provides critical optimization for Virtual Machines (VM) where database-centric and data-intensive workloads have requirements of both large memory size (hundreds of GB or several TB [2]), and high performance of address mapping. These VMs usually also require high availability, so tolerating and recovering from inevitable uncorrectable memory errors is usually provided by host RAS features for long VM uptime (SLA is 99.95% Monthly Uptime [3]). Due to the 1GB granularity, once a byte of memory in a hugepage is hardware corrupted, the kernel discards the whole 1G hugepage, not only the corrupted bytes but also the healthy portion, from HugeTLB system. In the cloud environment this is a great loss of memory to VM, putting VM in a dilemma: although the host is able to keep serving the VM, the VM itself struggles to continue its data-intensive workload with the unnecessary loss of ~1G data. On the other hand, simply terminating the VM greatly reduces its uptime given the frequency of uncorrectable memory errors occurrence. There was a RFC [4] that utilizes high granularity mapping [5] to more efficiently recover HugeTLB memory failures. However, it faded away with the high granularity mapping’s upstream proposal. The second use case comes from the discussion of MFR for huge VM_PFNMAP [6], which is to greatly improve TLB hits for PCI MMIO bars and kernel unmanaged host primary memory. They are most relevant for VMs that run Machine Learning (ML) workloads, which also requires reliable VM uptime (see [7] for detail). There is no official MFR support for VM_PFNMAP yet, but [8] is proposed. It unmaps non-struct-page memory without considering huge VM_PFNMAP. Peter, Jason and I discussed what will be the MFR behavior for huge VM_PFNMAP [9]: if the driver originally VM_PFNMAP-ed with PUD, it must first zap the PUD, and then intercept future page faults to either install PTE/PMD for clean PFNs, or return VM_FAULT_HWPOISON for poisoned PFNs. Zapping PUD means there will be a huge hole in the EPT or stage-2 (S2) page table, causing a lot of EPT or S2 violations that need to be fixed up by the device driver. There will be noticeable VM performance downgrades, not only during refilling, but also after the hole is refilled, as EPT or S2 is already fragmented. For both use cases, HARD_OFFLINE behavior in MFR arguably does more harm than good to the VM. For the 1st case, if we simply leave the 1G HugeTLB hugepage mapped, VM access to the clean PFNs within the poisoned 1G region still works well; we just need to still send SIGBUS to userspace in case of re-access poisoned PFNs to stop populating corrupted data. For the 2nd case, if zapping PUD doesn't happen there is no need for the driver to intercept page faults to clean memory on HBM or EGM. In addition, in both cases, there is no EPT or S2 violation, so no performance cost for accessing clean guest pages already mapped in EPT and S2. Keeping or discarding a large chunk of memory wrt memory error therefore sounds like something that userspace should be able to control. The virtual machine monitor can choose the better behavior for its managed VM. Background and Terminology ============================ First I want to set the scope of the userspace control in the process of kernel's memory error containment and recovery, which is drawn in [10]: 1. Interpret Hardware Fault: respond immediately to decode the exception generated by the hardware / firmware. On X86 the kernel needs to interpret machine check exceptions; on ARM the kernel needs to interpret the command platform error records. 2. Classify Error: the kernel classifies memory errors, white boxes in [10]. 3. Result: different memory errors end up with different results, gray boxes in [10]. The scope of the control exposed to userspace is in Memory Failure Recovery (highlighted in red in [10]), the only part that is relevant to userspace and needs improvement. Userspace policy defines what userspace wants kernel to do to the memory page in this stage. If specified by userspace, kernel performs actions conforming to the policy; otherwise the behavior is exactly what is in the kernel today. Once the recovery actions that must be executed by the kernel are done, if any, the kernel engages relevant userspace threads that must be signaled to prevent data corruption, and must participate in MFR. Depending on whether the memory error is corrected or uncorrectable, a memory page is referred as - Error page/hugepage/folio: the raw/huge/(either raw or huge) page consisting of the physical memory poisoned by platform's RAS. - Corrected page/hugepage/folio: the raw/huge/(either raw or huge) page consisting of the physical memory corrected by platform's RAS. We propose two design options for the uAPI of the MFR policy. Option 1. Global MFR Policy =========================== The global MFR policy depicted in [11] is "global" in the sense that 1. It is for entire system-level memory managed by the kernel, regardless of the underlying memory type. 2. It applies to all userspace threads, regardless if the physical memory is currently backing any VMA (free memory) or what VMAs it is backing. 3. It applies to PCI(e) device memory (e.g. HBM on GPU) as well, on the condition that their device driver deliberately wants to follow the kernel's MFR, instead of being entirely taken care of by device driver (e.g. drivers/nvdimm/pmem.c). Drawn in blue, the kernel initializes the MFR policy with the default policy to ensure two things. - There is always a policy during memory failure recovery available. - Default MFR is compatible with the existing MFR behavior in the kernel today. MFR Config, which can be any binary that runs in userspace with root privilege, configures or modifies the MFR policy offline to the MFR execution. MFR Config is independent from and does not interact with stakeholder threads at all. MFR config establishes the source-of-truth policy that is enacted when memory errors occur. The userspace API to set/get the policy is via sysctl: - /proc/sys/vm/enable_soft_offline: whether to SOFT_OFFLINE corrected folio - /proc/sys/vm/enable_hard_offline: whether to HARD_OFFLINE error folio SOFT_OFFLINE and HARD_OFFLINE are current upstream behavior and will be used as default policy when initializing global MFR policy. There is one important thing to point out in when /proc/sys/vm/enable_hard_offline = 0. The kernel does NOT set HWPoison flag in the struct page or struct folio. This behavior has implications now that no enforcement is done by kernel to isolate poisoned page and prevent both userspace and kernel from consuming memory error and causing hardware fault again (which used to be "setting the HWPoison flag"): 1. Userspace already has sufficient capability to prevent itself from consuming memory error and causing hardware fault: with the poisoned virtual address sent in SIGBUS, it can ask the kernel to remap the poisoned page with data loss, or simply abort the memory load operation. That being said, there needs to be a mechanism to detect and forcibly kill a malicious userspace thread if it keeps ignoring SIGBUS and generates hardware faults repeatedly. 2. Kernel won't be able to forbid the reuse of the free error pages in future memory allocations. If an error page is allocated to the kernel, when the kernel consumes the allocated error page, a kernel panic is most likely to happen. For userspace, it is now not guaranteed that newly allocated memory is free of memory errors. Option 2. Per-VMA MFR ===================== This design provides a policy for every struct vm_area_struct (VMA). VMA depicts a virtual memory area (a virtual address interval of contiguous memory that all share the same characteristics), and its granularity is per VM-area and per task. One major usage of VMA is to associate a virtual memory area of a process with a special rule for the page-fault handlers. Since the recovery action is mainly about the page fault (PF), how PF handler behaves wrt corrected or error page, attaching MFR policy into VMA sounds a natural fit. The interface for a userspace thread to set the MFR policy for one of its own virtual memory area is int madvise(void *vaddr, size_t length, int MADV_MFR_XXX) or if we want some "MFR master" to be able to assign policy to other threads of interests: int process_madvise(int pidfd, const struct iovec iovec[.n], size_t n, int MADV_MFR_XXX, unsigned int flags) where MADV_MFR_XXX is: MADV_MFR_HARD_OFFLINE: HARD_OFFLINE error folio. MADV_MFR_SOFT_OFFLINE: SOFT_OFFLINE corrected folio . MADV_MFR_KEEP_ERROR_MAPPED: keep error folio mapped to preserve the ability for userspace to re-access the error folio. MADV_MFR_KEEP_CORRECTED_MAPPED: keep corrected folio mapped to preserve the ability for userspace to re-access the corrected folio. MADV_MFR_KEEP_ERROR_MAPPED is the inverse of MADV_MFR_HARD_OFFLINE. MADV_MFR_KEEP_CORRECTED_MAPPED is the inverse of MADV_MFR_SOFT_OFFLINE. {MADV_MFR_HARD_OFFLINE, MADV_MFR_KEEP_ERROR_MAPPED} are independent with {MADV_MFR_SOFT_OFFLINE, MADV_MFR_KEEP_CORRECTED_MAPPED}. The exact behaviors of SOFT_OFFLINE and HARD_OFFLINE are specific to the page types. In general what will happen is - Some pages, including pages not impacted by memory error, can be unmapped from userspace. PF handler sends SIGBUS to userspace for every unmapped page. - Some pages, including pages not impacted by memory error, can be migrated to pages somewhere else. - Transparent hugepage will be split to raw pages. - HugeTLB hugepage will be dissolved to raw pages. - PUD/PMD/PTE installed for (huge) VM_PFNMAP will be removed. The default MFR policy is (MADV_MFR_SOFT_OFFLINE | MADV_MFR_HARD_OFFLINE) for every VMA at its creation time, to be consistent with the current kernel’s MFR behavior. Unmap and Page Fault Behavior with Per-VMA Policy ================================================= Before describing the behavior in kernel with the proposed per-VMA MFR policy, one thing not changed by per-VMA MFR policy but very worthy to point out is: regardless of the MFR policy configured to the VMA, the kernel sets the HWPoison flag on the struct page or struct folio. New MFR and page fault behavior with per-VMA policy is illustrated in [12], and here is a walkthrough with some pseudocode. If PF handler hasn't yet see the error page is HWPoison, a load to corrupted physical memory will kicks off the memory error containment process depicted in [10], and for the scope we care, corrected + recoverable uncorrected memory errors, kernel gets into the MFR box: page = pfn_to_page(pfn) pgoff = page_to_offset(page) SetPageHWPoison(page) // step 6 in [12] for thread in collect_procs_mapped(page): for vma in vma_interval_tree(thread, pgoff) if vma's MFR policy == MADV_MFR_OFFLINE: // step 7 in [12] try_to_unmap(vma, page) // step 8 in [12] vaddr = page_address_in_vma(vma, page) kill_proc(thread, vaddr) // step 9 in [12] Two outcomes for the owning thread after kernel sends SIGBUS - Thread is terminated and no possible access to the error page - Thread performs some recovery action and keeps running. In this case, a thread is able to attempt re-access the poisoned memory (step 1 in [12] "re-access HWPOISON"). When all threads that share the HWPoison-flagged folio exit, this problematic page will be isolated and prevented from future allocation. However, before that happens, there could still be a surviving thread or a sharing thread that attempts to access the HWPoison-flagged folio. For these threads, PF handler needs to handle the access to HWPOISON-flagged folio, either because folio is already unmapped somehow (e.g. by Memory Failure Recovery), or because folio is not yet mapped to thread: if PageHWPoison(page): // step 2 in [12] if vma's MFR policy != MADV_MFR_KEEP_ERROR_MAPPED: // step 3 in [12] kill_proc(current, vaddr) // step 4 in [12] return VM_FAULT_HWPOISON resolve the PF successfully install VA => PFN in thread's page table If the page fault is resolved and mapping installed to page table successfully, or there isn't a PF at all as the mapping is still present in thread's page table, there are two possible outcomes to the memory load to a folio that contains raw error page(s): - Sad case: The underlying physical memory mapped into vaddr has an uncorrectable memory error. Memory error consumption kicks off the memory error containment process depicted in [10]. - Happy case: The underlying physical memory mapped into vaddr is clean, i.e. the thread is accessing a healthy raw page in the compound hugepage. Thread gets the clean data it asked for. The worst case is, a thread will repeatedly hit the sad case. It spreads sadness to the entire system in the manner of denial of service. What can we do to mitigate this Loop of Sadness? Forcibly unmapping the poisoned folio won’t help, we must forcibly terminate the thread with SIGKILL. When different VMAs backed by the same chunk of memory, or physical address range (PAR), want to have conflicting MFR policies, how to resolve conflict? [13] illustrated an example of this. VMA1 wants MADV_MFR_KEEP_MAPPED but VMA2 wants MADV_MFR_HARD_OFFLINE. When MFR deals with error hugepage in PAR2, after the hugepage is kept in VMA1 but unmapped from VMA2 at the same time, should it dissolve the hugepage or leave it as it is? I think we can treat MADV_MFR_XXX as a non-persistent property of the VMA, like MADV_FREE, MADV_DONTNEED. After all, with VMA established, a thread can already modify the underlying physical memory content, why can't it change the MFR policy on it? A simple resolution to the conflicting MADV_MFR_XXX for PAR2 is that the policy is established by the thread that mapped it and last wrote it. Pros and Cons ============= Part of option 1 (/proc/sys/vm/enable_soft_offline) has already merged upstream. It already provides userspace the control necessary for corrected memory errors.