Hi, ALL, I have rewritten the cover letter with the hope that the maintainer will truly understand the necessity of this patch. Both Alibaba and Huawei met the same issue in products, and we hope it could be fixed ASAP. ## Changes Log changes since v8: - remove the bug fix tag of patch 2 (per Jarkko Sakkinen) - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi) - rewrite the return value comments of memory_failure (per Naoya Horiguchi) changes since v7: - rebase to Linux v6.6-rc2 (no code changed) - rewritten the cover letter to explain the motivation of this patchset changes since v6: - add more explicty error message suggested by Xiaofei - pick up reviewed-by tag from Xiaofei - pick up internal reviewed-by tag from Baolin changes since v5 by addressing comments from Kefeng: - document return value of memory_failure() - drop redundant comments in call site of memory_failure() - make ghes_do_proc void and handle abnormal case within it - pick up reviewed-by tag from Kefeng Wang changes since v4 by addressing comments from Xiaofei: - do a force kill only for abnormal sync errors changes since v3 by addressing comments from Xiaofei: - do a force kill for abnormal memory failure error such as invalid PA, unexpected severity, OOM, etc - pcik up tested-by tag from Ma Wupeng changes since v2 by addressing comments from Naoya: - rename mce_task_work to sync_task_work - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify() - add steps to reproduce this problem in cover letter changes since v1: - synchronous events by notify type - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@xxxxxxxxxxxxxxxxx/ ## Cover Letter There are two major types of uncorrected recoverable (UCR) errors : - Action Required (AR): The error is detected and the processor already consumes the memory. OS requires to take action (for example, offline failure page/kill failure thread) to recover this error. - Action Optional (AO): The error is detected out of processor execution context. Some data in the memory are corrupted. But the data have not been consumed. OS is optional to take action to recover this error. The main difference between AR and AO errors is that AR errors are synchronous events, while AO errors are asynchronous events. Synchronous exceptions, such as Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on Arm64, are signaled by the hardware when an error is detected and the memory access has architecturally been executed. Currently, both synchronous and asynchronous errors are queued as AO errors and handled by a dedicated kernel thread in a work queue on the ARM64 platform. For synchronous errors, memory_failure() is synced using a cancel_work_sync trick to ensure that the corrupted page is unmapped and poisoned. Upon returning to user-space, the process resumes at the current instruction, triggering a page fault. As a result, the kernel sends a SIGBUS signal to the current process due to VM_FAULT_HWPOISON. However, this trick is not always be effective, this patch set improves the recovery process in three specific aspects: 1. Handle synchronous exceptions with proper si_code ghes_handle_memory_failure() queue both synchronous and asynchronous errors with flag=0. Then the kernel will notify the process by sending a SIGBUS signal in memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code to distinguish to handle memory failure. For example, hwpoison-aware user-space processes use the si_code: BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR for 'action required' synchronous/late notifications. Specifically, when a signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored by QEMU.[1] Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1) 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot If process mapping fault page, but memory_failure() abnormal return before try_to_unmap(), for example, the fault page process mapping is KSM page. In this case, arm64 cannot use the page fault process to terminate the synchronous exception loop.[4] This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. However, kernel has the capability to recover from this error. Fix it by performing a force kill when memory_failure() abnormal fails or when other abnormal synchronous errors occur. These errors can include situations such as invalid PA, unexpected severity, no memory failure config support, invalid GUID section, OOM, etc. (PATCH 2) 3. Handle memory_failure() in current process context which consuming poison When synchronous errors occur, memory_failure() assume that current process context is exactly that consuming poison synchronous error. For example, kill_accessing_process() holds mmap locking of current->mm, does pagetable walk to find the error virtual address, and sends SIGBUS to the current process with error info. However, the mm of kworker is not valid, resulting in a null-pointer dereference. I have fixed this in[3]. commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process Another example is that collect_procs()/kill_procs() walk the task list, only collect and send sigbus to task which consuming poison. But memory_failure() is queued and handled by a dedicated kernel thread on arm64 platform. Fix it by queuing memory_failure() as a task work which runs in current execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2) ** In summary, this patch set handles synchronous errors in task work with proper si_code so that hwpoison-aware process can recover from errors, and fixes (potentially) abnormal cases. ** Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4]. Acknowledge to discussion with them. ## Steps to Reproduce This Problem To reproduce this problem: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 5 addr 0xffffb0d75000 page not present Test passed The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error and it is not fact. After this patch set: # STEP1: enable early kill mode #sysctl -w vm.memory_failure_early_kill=1 vm.memory_failure_early_kill = 1 # STEP2: inject an UCE error and consume it to trigger a synchronous error #einj_mem_uc single 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 injecting ... triggering ... signal 7 code 4 addr 0xffffb0d75000 page not present Test passed The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error as we expected. [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@xxxxxxxxxx/ [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@xxxxxxxxxx/ [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@xxxxxxxxxxxxxxxxx [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@xxxxxxxxxx/ Shuai Xue (2): ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events ACPI: APEI: handle synchronous exceptions in task work arch/x86/kernel/cpu/mce/core.c | 9 +-- drivers/acpi/apei/ghes.c | 113 ++++++++++++++++++++++----------- include/acpi/ghes.h | 3 - include/linux/mm.h | 1 - mm/memory-failure.c | 22 ++----- 5 files changed, 82 insertions(+), 66 deletions(-) -- 2.39.3