On 2024/2/23 20:17, Jonathan Cameron wrote: > On Fri, 23 Feb 2024 12:08:13 +0000 > Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote: > >> On Thu, 22 Feb 2024 21:26:43 -0800 >> Dan Williams <dan.j.williams@xxxxxxxxx> wrote: >> >>> Shuai Xue wrote: >>>> >>>> >>>> On 2024/2/19 17:25, Borislav Petkov wrote: >>>>> On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote: >>>>>> Synchronous error was detected as a result of user-space process accessing >>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception >>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a >>>>>> memory_failure() work which poisons the related page, unmaps the page, and >>>>>> then sends a SIGBUS to the process, so that a system wide panic can be >>>>>> avoided. >>>>>> >>>>>> However, no memory_failure() work will be queued when abnormal synchronous >>>>>> errors occur. These errors can include situations such as invalid PA, >>>>>> unexpected severity, no memory failure config support, invalid GUID >>>>>> section, etc. In such case, the user-space process will trigger SEA again. >>>>>> This loop can potentially exceed the platform firmware threshold or even >>>>>> trigger a kernel hard lockup, leading to a system reboot. >>>>>> >>>>>> Fix it by performing a force kill if no memory_failure() work is queued >>>>>> for synchronous errors. >>>>>> >>>>>> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> >>>>>> --- >>>>>> drivers/acpi/apei/ghes.c | 9 +++++++++ >>>>>> 1 file changed, 9 insertions(+) >>>>>> >>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>>>>> index 7b7c605166e0..0892550732d4 100644 >>>>>> --- a/drivers/acpi/apei/ghes.c >>>>>> +++ b/drivers/acpi/apei/ghes.c >>>>>> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes, >>>>>> } >>>>>> } >>>>>> >>>>>> + /* >>>>>> + * If no memory failure work is queued for abnormal synchronous >>>>>> + * errors, do a force kill. >>>>>> + */ >>>>>> + if (sync && !queued) { >>>>>> + pr_err("Sending SIGBUS to current task due to memory error not recovered"); >>>>>> + force_sig(SIGBUS); >>>>>> + } >>>>> >>>>> Except that there are a bunch of CXL GUIDs being handled there too and >>>>> this will sigbus those processes now automatically. >>>> >>>> Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always >>>> asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is >>>> delivered as a synchronous notification. >>>> >>>> Will the CXL component trigger synchronous events for which we need to terminate the >>>> current process by sending sigbus to process? >>> >>> None of the CXL component errors should be handled as synchronous >>> events. They are either asynchronous protocol errors, or effectively >>> equivalent to CPER_SEC_PLATFORM_MEM notifications. >> >> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA. >> > > Premature send.:( > > One example I can point at is how we do signaling of memory > errors detected by the host into a VM on arm64. > https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391 > CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA). > > Right now we've only used async in QEMU for proposed CXL error > CPER records signalling but your reference to them being similar > to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be > synchronous in some physical systems as it's one viable way to > provide rich information for synchronous reception of poison. > For the VM case my assumption today is we don't care about providing the > VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for > errors whether from CXL CPER records or not. > > Jonathan Thank you for your confirmation and explanation. So I think the condition: - `sync` for synchronous event, - `!queued` for CPER_SEC_PLATFORM_MEM notifications which do not handle memory failures. is fine. @Borislav, do you have any other concerns? Best Regards, Shuai