On Tue, Jul 4, 2023 at 12:18 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 04.07.23 08:50, Suren Baghdasaryan wrote: > > On Mon, Jul 3, 2023 at 10:39 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > >> > >> On Mon, Jul 3, 2023 at 8:30 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > >>> > >>> On 03.07.23 20:21, Suren Baghdasaryan wrote: > >>>> A memory corruption was reported in [1] with bisection pointing to the > >>>> patch [2] enabling per-VMA locks for x86. > >>>> Disable per-VMA locks config to prevent this issue while the problem is > >>>> being investigated. This is expected to be a temporary measure. > >>>> > >>>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217624 > >>>> [2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@xxxxxxxxxx > >>>> > >>>> Reported-by: Jiri Slaby <jirislaby@xxxxxxxxxx> > >>>> Reported-by: Jacob Young <jacobly.alt@xxxxxxxxx> > >>>> Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") > >>>> Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> > >>>> --- > >>>> mm/Kconfig | 2 +- > >>>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>>> > >>>> diff --git a/mm/Kconfig b/mm/Kconfig > >>>> index 09130434e30d..de94b2497600 100644 > >>>> --- a/mm/Kconfig > >>>> +++ b/mm/Kconfig > >>>> @@ -1224,7 +1224,7 @@ config ARCH_SUPPORTS_PER_VMA_LOCK > >>>> def_bool n > >>>> > >>>> config PER_VMA_LOCK > >>>> - def_bool y > >>>> + bool "Enable per-vma locking during page fault handling." > >>>> depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP > >>>> help > >>>> Allow per-vma locking during page fault handling. > >>> > >>> As raised at LSF/MM, I was "surprised" that we can now handle page faults > >>> concurrent to fork() and was expecting something to be broken already. > >>> > >>> What probably happens is that we wr-protected the page in the parent process and > >>> COW-shared an anon page with the child using copy_present_pte(). > >>> > >>> But we only flush the parent MM tlb before we drop the parent MM lock in > >>> dup_mmap(). > >>> > >>> > >>> If we get a write-fault before that TLB flush in the parent, and we end up > >>> replacing that anon page in the parent process in do_wp_page() [because, COW-shared with the child], > >>> this might be problematic: some stale writable TLB entries can target the wrong (old) page. > >> > >> Hi David, > >> Thanks for the detailed explanation. Let me check if this is indeed > >> what's happening here. If that's indeed the cause, I think we can > >> write-lock the VMAs being dup'ed until the TLB is flushed and > >> mmap_write_unlock(oldmm) unlocks them all and lets page faults to > >> proceed. If that works we at least will know the reason for the memory > >> corruption. > > > > Yep, locking the VMAs being copied inside dup_mmap() seems to fix the issue: > > > > for_each_vma(old_vmi, mpnt) { > > struct file *file; > > > > + vma_start_write(mpnt); > > if (mpnt->vm_flags & VM_DONTCOPY) { > > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); > > continue; > > } > > > > At least the reproducer at > > https://bugzilla.kernel.org/show_bug.cgi?id=217624 is working now. But > > I wonder if that's the best way to fix this. It's surely simple but > > locking every VMA is not free and doing that on every fork might > > regress performance. > > > That would mean that we can possibly still get page faults concurrent to > fork(), on the yet unprocessed part. While that fixes the issue at hand, > I cannot reliably tell if this doesn't mess with some other fork() > corner case. > > I'd suggest write-locking all VMAs upfront, before doing any kind of > fork-mm operation. Just like the old code did. See below. > > > > >> Thanks, > >> Suren. > >> > >>> > >>> > >>> We had similar issues in the past with userfaultfd, see the comment at the beginning of do_wp_page(): > >>> > >>> > >>> if (likely(!unshare)) { > >>> if (userfaultfd_pte_wp(vma, *vmf->pte)) { > >>> pte_unmap_unlock(vmf->pte, vmf->ptl); > >>> return handle_userfault(vmf, VM_UFFD_WP); > >>> } > >>> > >>> /* > >>> * Userfaultfd write-protect can defer flushes. Ensure the TLB > >>> * is flushed in this case before copying. > >>> */ > >>> if (unlikely(userfaultfd_wp(vmf->vma) && > >>> mm_tlb_flush_pending(vmf->vma->vm_mm))) > >>> flush_tlb_page(vmf->vma, vmf->address); > >>> } > > > > If do_wp_page() could identify that vmf->vma is being copied, we could > > simply return VM_FAULT_RETRY and retry the page fault under mmap_lock, > > which would block until dup_mmap() is done... Maybe we could use > > mm_tlb_flush_pending() for that? WDYT? > > I'm not convinced that we should be making that code more complicated > simply to speed up fork() with concurrent page faults. > > My gut feeling is that many operations that could possible take the VMA > lock in the future (page pinning triggering unsharing) should not run > concurrent with fork(). > > So IMHO, keep the old behavior of fork() -- i.e., no concurrent page > faults -- and unlock that eventually in the future when deemed really > required (but people should really avoid fork() in performance-sensitive > applications if not absolutely required). Thanks for the input, David. Yeah, that sounds reasonable. I'll test some more tomorrow morning and if everything looks good will post a patch to lock the VMAs and another one to re-enable CONFIG_PER_VMA_LOCK. Thanks for all the help! Suren. > > -- > Cheers, > > David / dhildenb >