Re: [PATCH 1/1] mm: disable CONFIG_PER_VMA_LOCK by default until its fixed

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Tue, 4 Jul 2023 00:34:46 -0700

On Tue, Jul 4, 2023 at 12:18 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 04.07.23 08:50, Suren Baghdasaryan wrote:
> > On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> >>
> >> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
> >>>
> >>> On 03.07.23 20:21, Suren Baghdasaryan wrote:
> >>>> A memory corruption was reported in [1] with bisection pointing to the
> >>>> patch [2] enabling per-VMA locks for x86.
> >>>> Disable per-VMA locks config to prevent this issue while the problem is
> >>>> being investigated. This is expected to be a temporary measure.
> >>>>
> >>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
> >>>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@xxxxxxxxxx
> >>>>
> >>>> Reported-by: Jiri Slaby <jirislaby@xxxxxxxxxx>
> >>>> Reported-by: Jacob Young <jacobly.alt@xxxxxxxxx>
> >>>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
> >>>> Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> >>>> ---
> >>>>    mm/Kconfig | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/Kconfig b/mm/Kconfig
> >>>> index 09130434e30d..de94b2497600 100644
> >>>> --- a/mm/Kconfig
> >>>> +++ b/mm/Kconfig
> >>>> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK
> >>>>           def_bool n
> >>>>
> >>>>    config PER_VMA_LOCK
> >>>> -     def_bool y
> >>>> +     bool "Enable per-vma locking during page fault handling."
> >>>>        depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
> >>>>        help
> >>>>          Allow per-vma locking during page fault handling.
> >>>
> >>> As raised at LSF/MM, I was "surprised" that we can now handle page faults
> >>> concurrent to fork() and was expecting something to be broken already.
> >>>
> >>> What probably happens is that we wr-protected the page in the parent process and
> >>> COW-shared an anon page with the child using copy_present_pte().
> >>>
> >>> But we only flush the parent MM tlb before we drop the parent MM lock in
> >>> dup_mmap().
> >>>
> >>>
> >>> If we get a write-fault before that TLB flush in the parent, and we end up
> >>> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child],
> >>> this might be problematic: some stale writable TLB entries can target the wrong (old) page.
> >>
> >> Hi David,
> >> Thanks for the detailed explanation. Let me check if this is indeed
> >> what's happening here. If that's indeed the cause, I think we can
> >> write-lock the VMAs being dup'ed until the TLB is flushed and
> >> mmap_write_unlock(oldmm) unlocks them all and lets page faults to
> >> proceed. If that works we at least will know the reason for the memory
> >> corruption.
> >
> > Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue:
> >
> >          for_each_vma(old_vmi, mpnt) {
> >                  struct file *file;
> >
> > +               vma_start_write(mpnt);
> >                 if (mpnt->vm_flags & VM_DONTCOPY) {
> >                         vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> >                          continue;
> >                 }
> >
> > At least the reproducer at
> > https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But
> > I wonder if that's the best way to fix this. It's surely simple but
> > locking every VMA is not free and doing that on every fork might
> > regress performance.
>
>
> That would mean that we can possibly still get page faults concurrent to
> fork(), on the yet unprocessed part. While that fixes the issue at hand,
> I cannot reliably tell if this doesn't mess with some other fork()
> corner case.
>
> I'd suggest write-locking all VMAs upfront, before doing any kind of
> fork-mm operation. Just like the old code did. See below.
>
> >
> >> Thanks,
> >> Suren.
> >>
> >>>
> >>>
> >>> We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page():
> >>>
> >>>
> >>>          if (likely(!unshare)) {
> >>>                  if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> >>>                          pte_unmap_unlock(vmf->pte, vmf->ptl);
> >>>                          return handle_userfault(vmf, VM_UFFD_WP);
> >>>                  }
> >>>
> >>>                  /*
> >>>                   * Userfaultfd write-protect can defer flushes. Ensure the TLB
> >>>                   * is flushed in this case before copying.
> >>>                   */
> >>>                  if (unlikely(userfaultfd_wp(vmf->vma) &&
> >>>                               mm_tlb_flush_pending(vmf->vma->vm_mm)))
> >>>                          flush_tlb_page(vmf->vma, vmf->address);
> >>>          }
> >
> > If do_wp_page() could identify that vmf->vma is being copied, we could
> > simply return VM_FAULT_RETRY and retry the page fault under mmap_lock,
> > which would block until dup_mmap() is done... Maybe we could use
> > mm_tlb_flush_pending() for that? WDYT?
>
> I'm not convinced that we should be making that code more complicated
> simply to speed up fork() with concurrent page faults.
>
> My gut feeling is that many operations that could possible take the VMA
> lock in the future (page pinning triggering unsharing) should not run
> concurrent with fork().
>
> So IMHO, keep the old behavior of fork() -- i.e., no concurrent page
> faults -- and unlock that eventually in the future when deemed really
> required (but people should really avoid fork() in performance-sensitive
> applications if not absolutely required).

Thanks for the input, David. Yeah, that sounds reasonable. I'll test
some more tomorrow morning and if everything looks good will post a
patch to lock the VMAs and another one to re-enable
CONFIG_PER_VMA_LOCK.
Thanks for all the help!
Suren.

>
> --
> Cheers,
>
> David / dhildenb
>