On Wed, 31 May 2023, Jann Horn wrote: > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) ... > > + * Note that vma->anon_vma check is racy: it can be set after > > + * the check, but page locks (with XA_RETRY_ENTRYs in holes) > > + * prevented establishing new ptes of the page. So we are safe > > + * to remove page table below, without even checking it's empty. > > This "we are safe to remove page table below, without even checking > it's empty" assumes that the only way to create new anonymous PTEs is > to use existing file PTEs, right? What about private shmem VMAs that > are registered with userfaultfd as VM_UFFD_MISSING? I think for those, > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without > looking at the mapping and its pages (except for checking that the > insertion point is before end-of-file), protected only by mmap_lock > (shared) and pte_offset_map_lock(). Right, from your comments and Peter's, thank you both, I can see that userfaultfd breaks the usual assumptions here: so I'm putting an if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) check in once we've got the ptlock; with a comment above it to point the blame at uffd, though I gave up on describing all the detail. And deleted this earlier "we are safe" paragraph. You did suggest, in another mail, that perhaps there should be a scan checking all pte_none() when we get the ptlock. I wasn't keen on yet another debug scan for bugs and didn't add that, thinking I was going to add a patch on the end to do so in page_table_check_pte_clear_range(). But when I came to write that patch, found that I'd been misled by its name: it's about checking or adjusting some accounting, not really a suitable place to check for pte_none() at all; so just scrapped it. ... > > - collapse_and_free_pmd(mm, vma, addr, pmd); > > The old code called collapse_and_free_pmd(), which involves MMU > notifier invocation... ... > > + pml = pmd_lock(mm, pmd); > > + ptl = pte_lockptr(mm, pmd); > > + if (ptl != pml) > > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > > + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd); > > ... while the new code only does pmdp_collapse_flush(), which clears > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU > notifiers. My understanding is that that's problematic - maybe (?) it > is sort of okay with regards to classic MMU notifier users like KVM, > but it's probably wrong for IOMMUv2 users, where an IOMMU directly > consumes the normal page tables? Right, I intentionally left out the MMU notifier invocation, knowing that we have already done an MMU notifier invocation when unmapping any PTEs which were mapped: it was necessary for collapse_and_free_pmd() in the collapse_pte_mapped_thp() case, but there was no notifier in this case for many years, and I was glad to be rid of it. However, I now see that you were adding it intentionally even for this case in your f268f6cf875f; and from later comments in this thread, it looks like there is still uncertainty about whether it is needed here, but safer to assume that it is needed: I'll add it back. > > (FWIW, last I looked, there also seemed to be some other issues with > MMU notifier usage wrt IOMMUv2, see the thread > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@xxxxxxxxxx/>.)