Re: [PATCH v11 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 23 Feb 2024 12:08:13 +0000
Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> wrote:

> On Thu, 22 Feb 2024 21:26:43 -0800
> Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> 
> > Shuai Xue wrote:  
> > > 
> > > 
> > > On 2024/2/19 17:25, Borislav Petkov wrote:    
> > > > On Sun, Feb 04, 2024 at 04:01:42PM +0800, Shuai Xue wrote:    
> > > >> Synchronous error was detected as a result of user-space process accessing
> > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exception
> > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a
> > > >> memory_failure() work which poisons the related page, unmaps the page, and
> > > >> then sends a SIGBUS to the process, so that a system wide panic can be
> > > >> avoided.
> > > >>
> > > >> However, no memory_failure() work will be queued when abnormal synchronous
> > > >> errors occur. These errors can include situations such as invalid PA,
> > > >> unexpected severity, no memory failure config support, invalid GUID
> > > >> section, etc. In such case, the user-space process will trigger SEA again.
> > > >> This loop can potentially exceed the platform firmware threshold or even
> > > >> trigger a kernel hard lockup, leading to a system reboot.
> > > >>
> > > >> Fix it by performing a force kill if no memory_failure() work is queued
> > > >> for synchronous errors.
> > > >>
> > > >> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
> > > >> ---
> > > >>  drivers/acpi/apei/ghes.c | 9 +++++++++
> > > >>  1 file changed, 9 insertions(+)
> > > >>
> > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > >> index 7b7c605166e0..0892550732d4 100644
> > > >> --- a/drivers/acpi/apei/ghes.c
> > > >> +++ b/drivers/acpi/apei/ghes.c
> > > >> @@ -806,6 +806,15 @@ static bool ghes_do_proc(struct ghes *ghes,
> > > >>  		}
> > > >>  	}
> > > >>  
> > > >> +	/*
> > > >> +	 * If no memory failure work is queued for abnormal synchronous
> > > >> +	 * errors, do a force kill.
> > > >> +	 */
> > > >> +	if (sync && !queued) {
> > > >> +		pr_err("Sending SIGBUS to current task due to memory error not recovered");
> > > >> +		force_sig(SIGBUS);
> > > >> +	}    
> > > > 
> > > > Except that there are a bunch of CXL GUIDs being handled there too and
> > > > this will sigbus those processes now automatically.    
> > > 
> > > Before the CXL GUIDs added, @Tony confirmed that the HEST notifications are always
> > > asynchronous on x86 platform, so only Synchronous External Abort (SEA) on ARM is
> > > delivered as a synchronous notification.
> > > 
> > > Will the CXL component trigger synchronous events for which we need to terminate the
> > > current process by sending sigbus to process?    
> > 
> > None of the CXL component errors should be handled as synchronous
> > events. They are either asynchronous protocol errors, or effectively
> > equivalent to CPER_SEC_PLATFORM_MEM notifications.  
> 
> Not a good example, CPER_SEC_PLATFORM_MEM is sometimes signaled via SEA.
> 

Premature send.:(

One example I can point at is how we do signaling of memory
errors detected by the host into a VM on arm64.
https://elixir.bootlin.com/qemu/latest/source/hw/acpi/ghes.c#L391
CPER_SEC_PLATFORM_MEM via ARM Synchronous External Abort (SEA).

Right now we've only used async in QEMU for proposed CXL error
CPER records signalling but your reference to them being similar
to CPER_SEC_PLATFORM_MEM is valid so 'maybe' they will be
synchronous in some physical systems as it's one viable way to
provide rich information for synchronous reception of poison.
For the VM case my assumption today is we don't care about providing the
VM with rich data, so CPER_SEC_PLATFORM_MEM is fine as a path for
errors whether from CXL CPER records or not.

Jonathan





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux