On Tue, Feb 7, 2023 at 3:35 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote: > > So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish > > PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that > > be any slower than what we currently do (like, incrementing up to > > HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern? > > I think it's faster. Both of these operations work on folio_nr_pages() > entries ... but a page table is 8 bytes and a struct page is 64 bytes. > From a CPU prefetching point of view, they're both linear scans, but > PTEs are 8 times denser. > > The other factor to consider is how often we do each of these operations. > Mapping a folio happens ~once per call to mmap() (even though it's delayed > until page fault time). Querying folio_total_mapcount() happens ... less > often, I think? Both are going to be quite rare since generally we map > the entire folio at once. Maybe this is a case where we would see a regression: doing PAGE_SIZE UFFDIO_CONTINUEs on a THP. Worst case, go from the end of the THP to the beginning (ending up with a PTE-mapped THP at the end). For the i'th PTE we map / i'th UFFDIO_CONTINUE, we have to check `folio_nr_pages() - i` PTEs (for most of the iterations anyway). Seems like this scales with the square of the size of the folio, so this approach would be kind of a non-starter for HugeTLB (with high-granularity mapping), I think. This example isn't completely contrived: if we did post-copy live migration with userfaultfd, we might end up doing something like this. I'm curious what you think. :) - James