Synchronous error was detected as a result of user-space process accessing a 2-bit uncorrected error. The CPU will take a synchronous error exception such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a memory_failure() work which poisons the related page, unmaps the page, and then sends a SIGBUS to the process, so that a system wide panic can be avoided. However, no memory_failure() work will be queued unless all bellow preconditions check passed: - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_handle_memory_failure() - `if (flags == -1)` in ghes_handle_memory_failure() - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_memory_failure() - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` in ghes_do_memory_failure() In such case, the user-space process will trigger SEA again. This loop can potentially exceed the platform firmware threshold or even trigger a kernel hard lockup, leading to a system reboot. Fix it by performing a force kill if no memory_failure() work is queued for synchronous errors. Suggested-by: Xiaofei Tan <tanxiaofei@xxxxxxxxxx> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> --- drivers/acpi/apei/ghes.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 623cc0cb4a65..b0b20ee533d9 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, } } + /* + * If no memory failure work is queued for abnormal synchronous + * errors, do a force kill. + */ + if (sync && !queued) { + pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption\n", + current->comm, task_pid_nr(current)); + force_sig(SIGBUS); + } + return queued; } -- 2.39.3