On Thu, Apr 20, 2023 at 11:05:12PM +0800, Kefeng Wang wrote: > > > On 2023/4/20 10:59, Kefeng Wang wrote: > > > > > > On 2023/4/20 10:03, Jane Chu wrote: > > > > > > On 4/19/2023 5:03 AM, Kefeng Wang wrote: > > > > > > > > > > > > On 2023/4/19 15:25, HORIGUCHI NAOYA(堀口 直也) wrote: > > > > > On Tue, Apr 18, 2023 at 05:45:06PM +0800, Kefeng Wang wrote: > > > > > > > > > > > > > > ... > > > > > > > > @@ -371,6 +372,14 @@ size_t > > > > > > > > _copy_mc_to_iter(const void *addr, size_t bytes, > > > > > > > > struct iov_iter *i) > > > > > > > > EXPORT_SYMBOL_GPL(_copy_mc_to_iter); > > > > > > > > #endif /* CONFIG_ARCH_HAS_COPY_MC */ > > > > > > > > +static void *memcpy_from_iter(struct iov_iter > > > > > > > > *i, void *to, const void *from, > > > > > > > > + size_t size) > > > > > > > > +{ > > > > > > > > + if (iov_iter_is_copy_mc(i)) > > > > > > > > + return (void *)copy_mc_to_kernel(to, from, size); > > > > > > > > > > > > > > Is it helpful to call memory_failure_queue() if > > > > > > > copy_mc_to_kernel() fails > > > > > > > due to a memory error? > > > > > > > > > > > > For dump_user_range(), the task is dying, if copy incomplete size, the > > > > > > coredump will fail and task will exit, also memory_failure will > > > > > > be called by kill_me_maybe(), > > > > > > > > > > > > CPU: 0 PID: 1418 Comm: test Tainted: G M > > > > > > 6.3.0-rc5 #29 > > > > > > Call Trace: > > > > > > <TASK> > > > > > > dump_stack_lvl+0x37/0x50 > > > > > > memory_failure+0x51/0x970 > > > > > > kill_me_maybe+0x5b/0xc0 > > > > > > task_work_run+0x5a/0x90 > > > > > > exit_to_user_mode_prepare+0x194/0x1a0 > > > > > > irqentry_exit_to_user_mode+0x9/0x30 > > > > > > noist_exc_machine_check+0x40/0x80 > > > > > > asm_exc_machine_check+0x33/0x40 > > > > > > > > > > Is this call trace printed out when copy_mc_to_kernel() > > > > > failed by finding > > > > > a memory error (or in some testcase using error injection)? > > > > > > > > I add dump_stack() into memory_failure() to check whether the poisoned > > > > memory is called or not, and the call trace shows it do call > > > > memory_failure(), but I get confused when do the test. > > > > > > > > > In my understanding, an MCE should not be triggered when > > > > > MC-safe copy tries > > > > > to access to a memory error. So I feel that we might be talking about > > > > > different scenarios. > > > > > > > > > > When I questioned previously, I thought about the following scenario: > > > > > > > > > > - a process terminates abnormally for any reason like > > > > > segmentation fault, > > > > > - then, kernel tries to create a coredump, > > > > > - during this, the copying routine accesses to corrupted > > > > > page to read. > > > > > > > > > Yes, we tested like your described, > > > > > > > > 1) inject memory error into a process > > > > 2) send a SIGABT/SIGBUS to process to trigger the coredump > > > > > > > > Without patch, the system panic, and with patch only process exits. > > > > > > > > > In this case the corrupted page should not be handled by > > > > > memory_failure() > > > > > yet (because otherwise properly handled hwpoisoned page > > > > > should be ignored > > > > > by coredump process). The coredump process would exit with > > > > > failure with > > > > > your patch, but then, the corrupted page is still left > > > > > unhandled and can > > > > > be reused, so any other thread can easily access to it again. > > > > > > > > As shown above, the corrupted page will be handled by > > > > memory_failure(), but what I'm wondering, > > > > 1) memory_failure() is not always called > > > > 2) look at the above call trace, it looks like from asynchronous > > > > interrupt, not from synchronous exception, right? > > > > > > > > > > > > > > You can find a few other places (like __wp_page_copy_user > > > > > and ksm_might_need_to_copy) > > > > > to call memory_failure_queue() to cope with such unhandled error pages. > > > > > So does memcpy_from_iter() do the same? > > > > > > > > I add some debug print in do_machine_check() on x86: > > > > > > > > 1) COW, > > > > m.kflags: MCE_IN_KERNEL_RECOV > > > > fixup_type: EX_TYPE_DEFAULT_MCE_SAFE > > > > > > > > CPU: 11 PID: 2038 Comm: einj_mem_uc > > > > Call Trace: > > > > <#MC> > > > > dump_stack_lvl+0x37/0x50 > > > > do_machine_check+0x7ad/0x840 > > > > exc_machine_check+0x5a/0x90 > > > > asm_exc_machine_check+0x1e/0x40 > > > > RIP: 0010:copy_mc_fragile+0x35/0x62 > > > > > > > > if (m.kflags & MCE_IN_KERNEL_RECOV) { > > > > if (!fixup_exception(regs, X86_TRAP_MC, 0, 0)) > > > > mce_panic("Failed kernel mode recovery", &m, msg); > > > > } > > > > > > > > if (m.kflags & MCE_IN_KERNEL_COPYIN) > > > > queue_task_work(&m, msg, kill_me_never); > > > > > > > > There is no memory_failure() called when > > > > EX_TYPE_DEFAULT_MCE_SAFE, also EX_TYPE_FAULT_MCE_SAFE too, > > > > so we manually add a memory_failure_queue() to handle with > > > > the poisoned page. > > > > > > > > 2) Coredump, nothing print about m.kflags and fixup_type, > > Sorry,I forget to set coredump file size :( > > The coredump do trigger the do_machine_check() with same m.kflags and > fixup_type like cow > > > > > > with above check, add a memory_failure_queue() or memory_failure() seems > > > > to be needed for memcpy_from_iter(), but it is totally different from > > > > the COW scenario > > > > > > so the memcpy_from_iter() from coredump is same as cow scenario. Okay, thank you for confirmation. > > > > > > > > > Another question, other copy_mc_to_kernel() callers, eg, > > > > nvdimm/dm-writecache/dax, there are not call memory_failure_queue(), > > > > should they need a memory_failure_queue(), if so, why not add it into > > > > do_machine_check() ? > > > > > > > What I mean is that EX_TYPE_DEFAULT_MCE_SAFE/EX_TYPE_FAULT_MCE_SAFE > > is designed to identify fixups which allow in kernel #MC recovery, > > that is, the caller of copy_mc_to_kernel() must know the source > > is a user address, so we could add a MCE_IN_KERNEL_COPYIN fro > > the MCE_SAFE type. > > And I think we need the following change for MCE_SAFE copy to set > MCE_IN_KERNEL_COPYIN. > > > > > diff --git a/arch/x86/kernel/cpu/mce/severity.c > > b/arch/x86/kernel/cpu/mce/severity.c > > index c4477162c07d..63e94484c5d6 100644 > > --- a/arch/x86/kernel/cpu/mce/severity.c > > +++ b/arch/x86/kernel/cpu/mce/severity.c > > @@ -293,12 +293,11 @@ static noinstr int error_context(struct mce *m, > > struct pt_regs *regs) > > case EX_TYPE_COPY: > > if (!copy_user) > > return IN_KERNEL; > > - m->kflags |= MCE_IN_KERNEL_COPYIN; This change seems to not related to what you try to fix. Could this break some other workloads like copying from user address? > > fallthrough; > > > > case EX_TYPE_FAULT_MCE_SAFE: > > case EX_TYPE_DEFAULT_MCE_SAFE: > > - m->kflags |= MCE_IN_KERNEL_RECOV; > > + m->kflags |= MCE_IN_KERNEL_RECOV | MCE_IN_KERNEL_COPYIN; > > return IN_KERNEL_RECOV; > > > > default: > > > > then we could drop memory_failure_queue(pfn, flags) from cow/ksm copy, > > or every Machine Check safe memory copy will need a memory_failure_xx() > > call. > > which help use to kill unneeded memory_failure_queue() call, any comments? I'm not 100% sure that we can safely use queue_task_work() instead of memory_failure_queue() (due to the difference between workqueue and task work, which should be recently discussed in thread [1]). So I prefer to keep the approach of memory_failure_queue() to keep the impact minimum. [1] https://lore.kernel.org/lkml/20230417011407.58319-1-xueshuai@xxxxxxxxxxxxxxxxx/T/#u Thanks, Naoya Horiguchi > > > > > > +Thomas,who add the two types, could you share some comments about > > this,thanks. > > > > > In the dax case, if the source address is poisoned, and we do follow > > > up with memory_failure_queue(pfn, flags), what should the value of > > > the 'flags' be ? > > > > With above diff change, we don't add a memory_failure_queue() into dax too.