Re: [PATCH 1/1] mm: disable CONFIG_PER_VMA_LOCK by default until its fixed

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 3 Jul 2023 23:50:59 -0700

On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >
> > On 03.07.23 20:21, Suren Baghdasaryan wrote:
> > > A memory corruption was reported in [1] with bisection pointing to the
> > > patch [2] enabling per-VMA locks for x86.
> > > Disable per-VMA locks config to prevent this issue while the problem is
> > > being investigated. This is expected to be a temporary measure.
> > >
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> > > [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@xxxxxxxxxx
> > >
> > > Reported-by: Jiri Slaby <jirislaby@xxxxxxxxxx>
> > > Reported-by: Jacob Young <jacobly.alt@xxxxxxxxx>
> > > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> > > Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> > > ---
> > >   mm/Kconfig | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 09130434e30d..de94b2497600 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
> > >          def_bool n
> > >
> > >   config PER_VMA_LOCK
> > > -     def_bool y
> > > +     bool "Enable per-vma locking during page fault handling."
> > >       depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> > >       help
> > >         Allow per-vma locking during page fault handling.
> >
> > As raised at LSF/MM, I was "surprised" that we can now handle page faults
> > concurrent to fork() and was expecting something to be broken already.
> >
> > What probably happens is that we wr-protected the page in the parent process and
> > COW-shared an anon page with the child using copy_present_pte().
> >
> > But we only flush the parent MM tlb before we drop the parent MM lock in
> > dup_mmap().
> >
> >
> > If we get a write-fault before that TLB flush in the parent, and we end up
> > replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
> > this might be problematic: some stale writable TLB entries can target the wrong (old) page.
>
> Hi David,
> Thanks for the detailed explanation. Let me check if this is indeed
> what's happening here. If that's indeed the cause, I think we can
> write-lock the VMAs being dup'ed until the TLB is flushed and
> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
> proceed. If that works we at least will know the reason for the memory
> corruption.

Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:

        for_each_vma(old_vmi, mpnt) {
                struct file *file;

+               vma_start_write(mpnt);
               if (mpnt->vm_flags & VM_DONTCOPY) {
                       vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
                        continue;
               }

At least the reproducer at
https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
I wonder if that's the best way to fix this. It's surely simple but
locking every VMA is not free and doing that on every fork might
regress performance.

> Thanks,
> Suren.
>
> >
> >
> > We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
> >
> >
> >         if (likely(!unshare)) {
> >                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> >                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> >                         return handle_userfault(vmf, VM_UFFD_WP);
> >                 }
> >
> >                 /*
> >                  * Userfaultfd write-protect can defer flushes. Ensure the TLB
> >                  * is flushed in this case before copying.
> >                  */
> >                 if (unlikely(userfaultfd_wp(vmf->vma) &&
> >                              mm_tlb_flush_pending(vmf->vma->vm_mm)))
> >                         flush_tlb_page(vmf->vma, vmf->address);
> >         }

If do_wp_page() could identify that vmf->vma is being copied, we could
simply return VM_FAULT_RETRY and retry the page fault under mmap_lock,
which would block until dup_mmap() is done... Maybe we could use
mm_tlb_flush_pending() for that? WDYT?

> >
> >
> > We really should not allow page faults concurrent to fork() without further investigation.
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >