On Fri, Oct 21, 2022 at 01:01:19PM -0700, Tony Luck wrote: > If the kernel is copying a page as the result of a copy-on-write > fault and runs into an uncorrectable error, Linux will crash because > it does not have recovery code for this case where poison is consumed > by the kernel. > > It is easy to set up a test case. Just inject an error into a private > page, fork(2), and have the child process write to the page. > > I wrapped that neatly into a test at: > > git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git > > just enable ACPI error injection and run: > > # ./einj_mem-uc -f copy-on-write > > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() > on architectures where that is available (currently x86 and powerpc). > When an error is detected during the page copy, return VM_FAULT_HWPOISON > to caller of wp_page_copy(). This propagates up the call stack. Both x86 > and powerpc have code in their fault handler to deal with this code by > sending a SIGBUS to the application. > > Note that this patch avoids a system crash and signals the process that > triggered the copy-on-write action. It does not take any action for the > memory error that is still in the shared page. To handle that a call to > memory_failure() is needed. But this cannot be done from wp_page_copy() > because it holds mmap_lock(). Perhaps the architecture fault handlers > can deal with this loose end in a subsequent patch? > > On Intel/x86 this loose end will often be handled automatically because > the memory controller provides an additional notification of the h/w > poison in memory, the handler for this will call memory_failure(). This > isn't a 100% solution. If there are multiple errors, not all may be > logged in this way. > > Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx> > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx> Thank you for the update. Looks good to me. Reviewed-by: Naoya Horiguchi <naoya.horiguchi@xxxxxxx>