On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > On 03.07.23 20:21, Suren Baghdasaryan wrote: > > > A memory corruption was reported in [1] with bisection pointing to the > > > patch [2] enabling per-VMA locks for x86. > > > Disable per-VMA locks config to prevent this issue while the problem is > > > being investigated. This is expected to be a temporary measure. > > > > > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624 > > > [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@xxxxxxxxxx > > > > > > Reported-by: Jiri Slaby <jirislaby@xxxxxxxxxx> > > > Reported-by: Jacob Young <jacobly.alt@xxxxxxxxx> > > > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") > > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > > --- > > > mm/Kconfig | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 09130434e30d..de94b2497600 100644 > > > --- a/mm/Kconfig > > > +++ b/mm/Kconfig > > > @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK > > > def_bool n > > > > > > config PER_VMA_LOCK > > > - def_bool y > > > + bool "Enable per-vma locking during page fault handling." > > > depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP > > > help > > > Allow per-vma locking during page fault handling. > > > > As raised at LSF/MM, I was "surprised" that we can now handle page faults > > concurrent to fork() and was expecting something to be broken already. > > > > What probably happens is that we wr-protected the page in the parent process and > > COW-shared an anon page with the child using copy_present_pte(). > > > > But we only flush the parent MM tlb before we drop the parent MM lock in > > dup_mmap(). > > > > > > If we get a write-fault before that TLB flush in the parent, and we end up > > replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child], > > this might be problematic: some stale writable TLB entries can target the wrong (old) page. > > Hi David, > Thanks for the detailed explanation. Let me check if this is indeed > what's happening here. If that's indeed the cause, I think we can > write-lock the VMAs being dup'ed until the TLB is flushed and > mmap_write_unlock(oldmm) unlocks them all and lets page faults to > proceed. If that works we at least will know the reason for the memory > corruption. Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue: for_each_vma(old_vmi, mpnt) { struct file *file; + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue; } At least the reproducer at https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But I wonder if that's the best way to fix this. It's surely simple but locking every VMA is not free and doing that on every fork might regress performance. > Thanks, > Suren. > > > > > > > We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page(): > > > > > > if (likely(!unshare)) { > > if (userfaultfd_pte_wp(vma, *vmf->pte)) { > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > return handle_userfault(vmf, VM_UFFD_WP); > > } > > > > /* > > * Userfaultfd write-protect can defer flushes. Ensure the TLB > > * is flushed in this case before copying. > > */ > > if (unlikely(userfaultfd_wp(vmf->vma) && > > mm_tlb_flush_pending(vmf->vma->vm_mm))) > > flush_tlb_page(vmf->vma, vmf->address); > > } If do_wp_page() could identify that vmf->vma is being copied, we could simply return VM_FAULT_RETRY and retry the page fault under mmap_lock, which would block until dup_mmap() is done... Maybe we could use mm_tlb_flush_pending() for that? WDYT? > > > > > > We really should not allow page faults concurrent to fork() without further investigation. > > > > -- > > Cheers, > > > > David / dhildenb > >