[RFC 2/2] ACPI: APEI: fix reboot caused by synchronous error loop because of memory_failure() failed

Lv Ying <lvying6@xxxxxxxxxx> · Mon, 5 Dec 2022 19:51:11 +0800

Synchronous error was detected as a result of user-space accessing a
corrupt memory location the CPU may take an abort instead. On arm64 this
is a 'synchronous external abort' which can be notified by SEA.

If memory_failure() failed, we return to user-space will trigger SEA again,
such loop may cause platform firmware to exceed some threshold and reboot
when Linux could have recovered from this error. Not all memory_failure()
processing failures will cause the reboot, VM_FAULT_HWPOISON[_LARGE]
handling in arm64 page fault will send SIGBUS signal to the user-space
accessing process to terminate this loop.

If process mapping fault page, but memory_failure() abnormal return before
try_to_unmap(), for example, the fault page process mapping is KSM page.
In this case, arm64 cannot use the page fault process to terminate the
loop.

Add judgement of memory_failure() result in task_work before returning to
user-space. If memory_failure() failed, send SIGBUS signal to the current
process to avoid SEA loop.

Signed-off-by: Lv Ying <lvying6@xxxxxxxxxx>
---
 mm/memory-failure.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3b6ac3694b8d..4c1c558f7161 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2266,7 +2266,11 @@ static void __memory_failure_work_func(struct work_struct *work, bool sync)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
 			soft_offline_page(entry.pfn, entry.flags);
-		else if (!sync || (entry.flags & MF_ACTION_REQUIRED))
+		else if (sync) {
+			if ((entry.flags & MF_ACTION_REQUIRED) &&
+					memory_failure(entry.pfn, entry.flags))
+				force_sig_mceerr(BUS_MCEERR_AR, 0, 0);
+		} else
 			memory_failure(entry.pfn, entry.flags);
 	}
 }
-- 
2.33.0