Shuai Xue wrote: > > > On 2024/2/19 17:25, Borislav Petkov wrote: > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote: > >> Synchronous error was detected as a result of user-space process accessing > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a > >> memory_failure() work which poisons the related page, unmaps the page, and > >> then sends a SIGBUS to the process, so that a system wide panic can be > >> avoided. > >> > >> However, no memory_failure() work will be queued when abnormal synchronous > >> errors occur. These errors can include situations such as invalid PA, > >> unexpected severity, no memory failure config support, invalid GUID > >> section, etc. In such case, the user-space process will trigger SEA again. > >> This loop can potentially exceed the platform firmware threshold or even > >> trigger a kernel hard lockup, leading to a system reboot. > >> > >> Fix it by performing a force kill if no memory_failure() work is queued > >> for synchronous errors. > >> > >> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > >> --- > >> drivers/acpi/apei/ghes.c | 9 +++++++++ > >> 1 file changed, 9 insertions(+) > >> > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > >> index 7b7c605166e0..0892550732d4 100644 > >> --- a/drivers/acpi/apei/ghes.c > >> +++ b/drivers/acpi/apei/ghes.c > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes, > >> } > >> } > >> > >> + /* > >> + * If no memory failure work is queued for abnormal synchronous > >> + * errors, do a force kill. > >> + */ > >> + if (sync && !queued) { > >> + pr_err("Sending SIGBUS to current task due to memory error not recovered"); > >> + force_sig(SIGBUS); > >> + } > > > > Except that there are a bunch of CXL GUIDs being handled there too and > > this will sigbus those processes now automatically. > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is > delivered as a synchronous notification. > > Will the CXL component trigger synchronous events for which we need to terminate the > current process by sending sigbus to process? None of the CXL component errors should be handled as synchronous events. They are either asynchronous protocol errors, or effectively equivalent to CPER_SEC_PLATFORM_MEM notifications.