On 2023/3/16 PM3:21, HORIGUCHI NAOYA(堀口 直也) wrote: > On Mon, Feb 27, 2023 at 01:03:15PM +0800, Shuai Xue wrote: >> Hardware errors could be signaled by synchronous interrupt, e.g. when an >> error is detected by a background scrubber, or signaled by synchronous >> exception, e.g. when an uncorrected error is consumed. Both synchronous and >> asynchronous error are queued and handled by a dedicated kthread in >> workqueue. >> >> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for >> synchronous errors") keep track of whether memory_failure() work was >> queued, and make task_work pending to flush out the workqueue so that the >> work for synchronous error is processed before returning to user-space. >> The trick ensures that the corrupted page is unmapped and poisoned. And >> after returning to user-space, the task starts at current instruction which >> triggering a page fault in which kernel will send SIGBUS to current process >> due to VM_FAULT_HWPOISON. >> >> However, the memory failure recovery for hwpoison-aware mechanisms does not >> work as expected. For example, hwpoison-aware user-space processes like >> QEMU register their customized SIGBUS handler and enable early kill mode by >> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify >> the process by sending a SIGBUS signal in memory failure with wrong >> si_code: the actual user-space process accessing the corrupt memory >> location, but its memory failure work is handled in a kthread context, so >> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space >> process instead of BUS_MCEERR_AR in kill_proc(). >> >> To this end, separate synchronous and asynchronous error handling into >> different paths like X86 platform does: >> >> - task work for synchronous errors. >> - and workqueue for asynchronous errors. >> >> Then for synchronous errors, the current context in memory failure is >> exactly belongs to the task consuming poison data and it will send SIBBUS >> with proper si_code. >> >> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") >> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> > ... >> >> /* >> - * Called as task_work before returning to user-space. >> - * Ensure any queued work has been done before we return to the context that >> - * triggered the notification. >> + * struct mce_task_work - for synchronous RAS event > > This seems to handle synchronous memory errors, not limited to MCE? > So naming this struct as such (more generally) might be better. Yes. How about `sync_task_work`? > >> + * >> + * @twork: callback_head for task work >> + * @pfn: page frame number of corrupted page >> + * @flags: fine tune action taken >> + * >> + * Structure to pass task work to be handled before >> + * ret_to_user via task_work_add(). >> */ > ... > >> } >> >> -static bool ghes_do_memory_failure(u64 physical_addr, int flags) >> +static void ghes_do_memory_failure(u64 physical_addr, int flags) >> { >> unsigned long pfn; >> + struct mce_task_work *twcb; >> >> if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE)) >> - return false; >> + return; >> >> pfn = PHYS_PFN(physical_addr); >> if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) { >> pr_warn_ratelimited(FW_WARN GHES_PFX >> "Invalid address in generic error data: %#llx\n", >> physical_addr); >> - return false; >> + return; >> + } >> + >> + if (flags == MF_ACTION_REQUIRED && current->mm) { >> + twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC); >> + if (!twcb) >> + return; > > When this kmalloc() fails, the error event might be silently dropped? > If so, some warning messages could be helpful. Yes, I was going to add a warning messages like: pr_err("Failed to handle memory failure due to out of memory\n"); But got a warning about patch when checked by checkpatch.pl. WARNING: Possible unnecessary 'out of memory' message I will add it back in next version :) > > Thanks, > Naoya Horiguchi Thank you for your comments. Cheer, Shuai > >> + >> + twcb->pfn = pfn; >> + twcb->flags = flags; >> + init_task_work(&twcb->twork, memory_failure_cb); >> + task_work_add(current, &twcb->twork, TWA_RESUME); >> + return; >> } >> >> memory_failure_queue(pfn, flags);