Re: Folio mapcount

James Houghton <jthoughton@xxxxxxxxxx> · Tue, 7 Feb 2023 16:35:30 -0800

On Tue, Feb 7, 2023 at 3:35 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Tue, Feb 07, 2023 at 03:27:07PM -0800, James Houghton wrote:
> > So page_vma_mapped_walk() might have to walk up to HPAGE_PMD_NR-ish
> > PTEs (if we find a bunch of pte_none() PTEs). Just curious, could that
> > be any slower than what we currently do (like, incrementing up to
> > HPAGE_PMD_NR-ish subpage mapcounts)? Or is it not a concern?
>
> I think it's faster.  Both of these operations work on folio_nr_pages()
> entries ... but a page table is 8 bytes and a struct page is 64 bytes.
> From a CPU prefetching point of view, they're both linear scans, but
> PTEs are 8 times denser.

>
> The other factor to consider is how often we do each of these operations.
> Mapping a folio happens ~once per call to mmap() (even though it's delayed
> until page fault time).  Querying folio_total_mapcount() happens ... less
> often, I think?  Both are going to be quite rare since generally we map
> the entire folio at once.

Maybe this is a case where we would see a regression: doing PAGE_SIZE
UFFDIO_CONTINUEs on a THP. Worst case, go from the end of the THP to
the beginning (ending up with a PTE-mapped THP at the end).

For the i'th PTE we map / i'th UFFDIO_CONTINUE, we have to check
`folio_nr_pages() - i` PTEs (for most of the iterations anyway). Seems
like this scales with the square of the size of the folio, so this
approach would be kind of a non-starter for HugeTLB (with
high-granularity mapping), I think.

This example isn't completely contrived: if we did post-copy live
migration with userfaultfd, we might end up doing something like this.
I'm curious what you think. :)

- James