On Wed, Oct 30, 2024 at 09:54:00AM +0800, Shuai Xue wrote: > > > 在 2024/10/30 04:48, Yazen Ghannam 写道: > > On Mon, Oct 28, 2024 at 04:11:40PM +0800, Shuai Xue wrote: > > > Synchronous error was detected as a result of user-space process accessing > > > a 2-bit uncorrected error. The CPU will take a synchronous error exception > > > such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a > > > memory_failure() work which poisons the related page, unmaps the page, and > > > then sends a SIGBUS to the process, so that a system wide panic can be > > > avoided. > > > > > > However, no memory_failure() work will be queued when abnormal synchronous > > > errors occur. These errors can include situations such as invalid PA, > > > unexpected severity, no memory failure config support, invalid GUID > > > section, etc. In such case, the user-space process will trigger SEA again. > > > This loop can potentially exceed the platform firmware threshold or even > > > trigger a kernel hard lockup, leading to a system reboot. > > > > > > Fix it by performing a force kill if no memory_failure() work is queued > > > for synchronous errors. > > > > > > Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > > > Reviewed-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx> > > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > > > --- > > > drivers/acpi/apei/ghes.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > > > index ada93cfde9ba..f2ee28c44d7a 100644 > > > --- a/drivers/acpi/apei/ghes.c > > > +++ b/drivers/acpi/apei/ghes.c > > > @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, > > > } > > > } > > > + /* > > > + * If no memory failure work is queued for abnormal synchronous > > > + * errors, do a force kill. > > > + */ > > > + if (sync && !queued) { > > > + pr_err("%s:%d: hardware memory corruption (SIGBUS)\n", > > > + current->comm, task_pid_nr(current)); > > > > I think it would help to include the GHES_PFX to indicate where this > > message is coming from. The pr_fmt() macro could also be introduced > > instead. > > Yes, GHES_PFX is a effective prefix and will be consistent to other message > in GHES driver. Will add it in next version. > > What do you mean about pr_fmt()? This can be used to set a prefix for an entire section of code. The pr_*() macros will pick it up without needing to include a prefix for each call. This is described in "Documentation/core-api/printk-basics.rst". > > > > > Also, you may want to include the HW_ERR prefix. Not all kernel messages > > related to hardware errors have this prefix today. But maybe that should > > be changed so there is more consistent messaging. > > > > Do we really need a HW_ERR prefix? The other case which use HW_ERR prefix > are for hardware registers. The messages which send SIGBUS does > not include HW_ERR, e.g. in kill_proc(), kill_procs(). > > pr_err("%#lx: Sending SIGBUS to %s:%d due to hardware memory > corruption\n",... > pr_err("%#lx: forcibly killing %s:%d because of failure to unmap > corrupted page\n",... > > Correct, HW_ERR isn't used there. My interpretation is that it can be used whenever an event is due to a hardware error (real or simulated). This is a very clear message to a user. It may be redundant in some cases (like here where the message already says "hardware memory corruption"). But I think it would be go to use it anyway for consistency. I think other relevant places in the kernel should also be updated. But that is beyond this patch, and I don't expect it to be done here and now. Thanks, Yazen