Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> · Tue, 21 Nov 2023 09:48:28 +0800

Hi, ALL,

Gentle ping.

Best Regards,
Shuai

On 2023/10/7 15:28, Shuai Xue wrote:
> Hi, ALL,
> 
> I have rewritten the cover letter with the hope that the maintainer will truly
> understand the necessity of this patch. Both Alibaba and Huawei met the same
> issue in products, and we hope it could be fixed ASAP.
> 
> ## Changes Log
> 
> changes since v8:
> - remove the bug fix tag of patch 2 (per Jarkko Sakkinen)
> - remove the declaration of memory_failure_queue_kick (per Naoya Horiguchi)
> - rewrite the return value comments of memory_failure (per Naoya Horiguchi)
> 
> changes since v7:
> - rebase to Linux v6.6-rc2 (no code changed)
> - rewritten the cover letter to explain the motivation of this patchset
> 
> changes since v6:
> - add more explicty error message suggested by Xiaofei
> - pick up reviewed-by tag from Xiaofei
> - pick up internal reviewed-by tag from Baolin
> 
> changes since v5 by addressing comments from Kefeng:
> - document return value of memory_failure()
> - drop redundant comments in call site of memory_failure() 
> - make ghes_do_proc void and handle abnormal case within it
> - pick up reviewed-by tag from Kefeng Wang 
> 
> changes since v4 by addressing comments from Xiaofei:
> - do a force kill only for abnormal sync errors
> 
> changes since v3 by addressing comments from Xiaofei:
> - do a force kill for abnormal memory failure error such as invalid PA,
> unexpected severity, OOM, etc
> - pcik up tested-by tag from Ma Wupeng
> 
> changes since v2 by addressing comments from Naoya:
> - rename mce_task_work to sync_task_work
> - drop ACPI_HEST_NOTIFY_MCE case in is_hest_sync_notify()
> - add steps to reproduce this problem in cover letter
> 
> changes since v1:
> - synchronous events by notify type
> - Link: https://lore.kernel.org/lkml/20221206153354.92394-3-xueshuai@xxxxxxxxxxxxxxxxx/
> 
> 
> ## Cover Letter
> 
> There are two major types of uncorrected recoverable (UCR) errors :
> 
> - Action Required (AR): The error is detected and the processor already
>   consumes the memory. OS requires to take action (for example, offline
>   failure page/kill failure thread) to recover this error.
> 
> - Action Optional (AO): The error is detected out of processor execution
>   context. Some data in the memory are corrupted. But the data have not
>   been consumed. OS is optional to take action to recover this error.
> 
> The main difference between AR and AO errors is that AR errors are synchronous
> events, while AO errors are asynchronous events. Synchronous exceptions, such as
> Machine Check Exception (MCE) on X86 and Synchronous External Abort (SEA) on
> Arm64, are signaled by the hardware when an error is detected and the memory
> access has architecturally been executed.
> 
> Currently, both synchronous and asynchronous errors are queued as AO errors and
> handled by a dedicated kernel thread in a work queue on the ARM64 platform. For
> synchronous errors, memory_failure() is synced using a cancel_work_sync trick to
> ensure that the corrupted page is unmapped and poisoned. Upon returning to
> user-space, the process resumes at the current instruction, triggering a page
> fault. As a result, the kernel sends a SIGBUS signal to the current process due
> to VM_FAULT_HWPOISON.
> 
> However, this trick is not always be effective, this patch set improves the
> recovery process in three specific aspects:
> 
> 1. Handle synchronous exceptions with proper si_code
> 
> ghes_handle_memory_failure() queue both synchronous and asynchronous errors with
> flag=0. Then the kernel will notify the process by sending a SIGBUS signal in
> memory_failure() with wrong si_code: BUS_MCEERR_AO to the actual user-space
> process instead of BUS_MCEERR_AR. The user-space processes rely on the si_code
> to distinguish to handle memory failure.
> 
> For example, hwpoison-aware user-space processes use the si_code:
> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
> for 'action required' synchronous/late notifications. Specifically, when a
> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
> by QEMU.[1]
> 
> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)
> 
> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
> 
> If process mapping fault page, but memory_failure() abnormal return before
> try_to_unmap(), for example, the fault page process mapping is KSM page.
> In this case, arm64 cannot use the page fault process to terminate the
> synchronous exception loop.[4]
> 
> This loop can potentially exceed the platform firmware threshold or even trigger
> a kernel hard lockup, leading to a system reboot. However, kernel has the
> capability to recover from this error.
> 
> Fix it by performing a force kill when memory_failure() abnormal fails or when
> other abnormal synchronous errors occur. These errors can include situations
> such as invalid PA, unexpected severity, no memory failure config support,
> invalid GUID section, OOM, etc. (PATCH 2)
> 
> 3. Handle memory_failure() in current process context which consuming poison
> 
> When synchronous errors occur, memory_failure() assume that current process
> context is exactly that consuming poison synchronous error.
> 
> For example, kill_accessing_process() holds mmap locking of current->mm, does
> pagetable walk to find the error virtual address, and sends SIGBUS to the
> current process with error info. However, the mm of kworker is not valid,
> resulting in a null-pointer dereference. I have fixed this in[3].
> 
>     commit 77677cdbc2aa mm,hwpoison: check mm when killing accessing process
> 
> Another example is that collect_procs()/kill_procs() walk the task list, only
> collect and send sigbus to task which consuming poison. But memory_failure() is
> queued and handled by a dedicated kernel thread on arm64 platform.
> 
> Fix it by queuing memory_failure() as a task work which runs in current
> execution context to synchronously send SIGBUS before ret_to_user. (PATCH 2)
> 
> ** In summary, this patch set handles synchronous errors in task work with
> proper si_code so that hwpoison-aware process can recover from errors, and
> fixes (potentially) abnormal cases. **
> 
> Lv Ying and XiuQi from Huawei also proposed to address similar problem[2][4].
> Acknowledge to discussion with them.
> 
> ## Steps to Reproduce This Problem
> 
> To reproduce this problem:
> 
> 	# STEP1: enable early kill mode
> 	#sysctl -w vm.memory_failure_early_kill=1
> 	vm.memory_failure_early_kill = 1
> 
> 	# STEP2: inject an UCE error and consume it to trigger a synchronous error
> 	#einj_mem_uc single
> 	0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
> 	injecting ...
> 	triggering ...
> 	signal 7 code 5 addr 0xffffb0d75000
> 	page not present
> 	Test passed
> 
> The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO error
> and it is not fact.
> 
> After this patch set:
> 
> 	# STEP1: enable early kill mode
> 	#sysctl -w vm.memory_failure_early_kill=1
> 	vm.memory_failure_early_kill = 1
> 
> 	# STEP2: inject an UCE error and consume it to trigger a synchronous error
> 	#einj_mem_uc single
> 	0: single   vaddr = 0xffffb0d75400 paddr = 4092d55b400
> 	injecting ...
> 	triggering ...
> 	signal 7 code 4 addr 0xffffb0d75000
> 	page not present
> 	Test passed
> 
> The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR error
> as we expected.
> 
> [1] Add ARMv8 RAS virtualization support in QEMU https://patchew.org/QEMU/20200512030609.19593-1-gengdongjiu@xxxxxxxxxx/
> [2] https://lore.kernel.org/lkml/20221205115111.131568-3-lvying6@xxxxxxxxxx/
> [3] https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@xxxxxxxxxxxxxxxxx
> [4] https://lore.kernel.org/lkml/20221209095407.383211-1-lvying6@xxxxxxxxxx/
> 
> Shuai Xue (2):
>   ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on
>     synchronous events
>   ACPI: APEI: handle synchronous exceptions in task work
> 
>  arch/x86/kernel/cpu/mce/core.c |   9 +--
>  drivers/acpi/apei/ghes.c       | 113 ++++++++++++++++++++++-----------
>  include/acpi/ghes.h            |   3 -
>  include/linux/mm.h             |   1 -
>  mm/memory-failure.c            |  22 ++-----
>  5 files changed, 82 insertions(+), 66 deletions(-)
>