On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > thought it had emptied: page lock on the huge page is enough to protect > against WP faults (which find the PTE has been cleared), but not enough > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > retract_page_tables() protects against this by checking !vma->anon_vma; > but we know that MADV_COLLAPSE needs to be able to work on private shmem > mappings, even those with an anon_vma prepared for another part of the > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > mappings which are userfaultfd_armed(). Whether it needs to work on > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > but assume that it does. I think we couldn't rely on anon_vma here anyway, since holding the mmap_lock in read mode doesn't prevent concurrent creation of an anon_vma? > Just for this case, take the pmd_lock() two steps earlier: not because > it gives any protection against this case itself, but because ptlock > nests inside it, and it's the dropping of ptlock which let the bug in. > In other cases, continue to minimize the pmd_lock() hold time. Special-casing userfaultfd like this makes me a bit uncomfortable; but I also can't find anything other than userfaultfd that would insert pages into regions that are khugepaged-compatible, so I guess this works? I guess an alternative would be to use a spin_trylock() instead of the current pmd_lock(), and if that fails, temporarily drop the page table lock and then restart from step 2 with both locks held - and at that point the page table scan should be fast since we expect it to usually be empty.