Give userspace the control to enable or disable HARD_OFFLINE error folio (either a raw page or a hugepage). By default, HARD_OFFLINE is enabled to be consistent with existing memory_failure behavior. Userspace should be able to control whether to keep or discard a large chunk of memory in the event of uncorrectable memory errors. There are two major use cases in cloud environments. The 1st case is 1G HugeTLB-backed database workload. Comparing to discarding the hugepage when only single PFN is impacted by uncorrectable memory error, if kernel simply leaves the 1G hugepage mapped, access to major of clean PFNs within the poisoned 1G region still works well for VM and workload. The 2nd case is MMIO device memory or EGM [1] mapped to userspace via huge VM_PFNMAP [2]. If kernel does not zap PUD or PMD, there is no need for the VFIO drivers that manages the memory to intercept page faults for clean PFNs and to reinstall PTEs. In addition, in both cases there is no EPT or stage-2 (S2) violation, so no performance cost for accessing clean guest pages already mapped in EPT or S2. See cover letter for more details on why userspace need such control, and implication when userspace chooses to disable HARD_OFFLINE. If this RFC receives general positive feedbacks, I will add selftest in v2. [1] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [2] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@xxxxxxxxxx/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3 Signed-off-by: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> --- mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7066fc84f351..a7b85b98d61e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -70,6 +70,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1; static int sysctl_enable_soft_offline __read_mostly = 1; +static int sysctl_enable_hard_offline __read_mostly = 1; + atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); static bool hw_memory_failure __read_mostly = false; @@ -151,6 +153,15 @@ static struct ctl_table memory_failure_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, + }, + { + .procname = "enable_hard_offline", + .data = &sysctl_enable_hard_offline, + .maxlen = sizeof(sysctl_enable_hard_offline), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, } }; @@ -2223,6 +2234,14 @@ int memory_failure(unsigned long pfn, int flags) p = pfn_to_online_page(pfn); if (!p) { + /* + * For ZONE_DEVICE memory and memory on special architectures, + * assume they have opt out core kernel's MFR. Since these + * memory can still be mapped to userspace, let userspace + * know MFR doesn't apply. + */ + pr_info_once("%#lx: can't apply global MFR policy\n", pfn); + res = arch_memory_failure(pfn, flags); if (res == 0) goto unlock_mutex; @@ -2241,6 +2260,20 @@ int memory_failure(unsigned long pfn, int flags) goto unlock_mutex; } + /* + * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't + * register to SEA notifications from firmware), memory_failure will + * never be synchrounous to the error consumption thread. Notifying + * it via SIGBUS synchrnously has to be done by either core kernel in + * do_mem_abort, or KVM in kvm_handle_guest_abort. + */ + if (!sysctl_enable_hard_offline) { + pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn); + kill_procs_now(p, pfn, flags, page_folio(p)); + res = -EOPNOTSUPP; + goto unlock_mutex; + } + try_again: res = try_memory_failure_hugetlb(pfn, flags, &hugetlb); if (hugetlb) -- 2.46.0.792.g87dc391469-goog