On 12/9/20 6:14 PM, Matthew Wilcox wrote: > On Wed, Dec 09, 2020 at 12:24:38PM -0400, Jason Gunthorpe wrote: >> On Wed, Dec 09, 2020 at 04:02:05PM +0000, Joao Martins wrote: >> >>> Today (without the series) struct pages are not represented the way they >>> are expressed in the page tables, which is what I am hoping to fix in this >>> series thus initializing these as compound pages of a given order. But me >>> introducing PGMAP_COMPOUND was to conservatively keep both old (non-compound) >>> and new (compound pages) co-exist. >> >> Oooh, that I didn't know.. That is kind of horrible to have a PMD >> pointing at an order 0 page only in this one special case. > > Uh, yes. I'm surprised it hasn't caused more problems. > There was 1 or 2 problems in the KVM MMU related to zone device pages. See commit e851265a816f ("KVM: x86/mmu: Use huge pages for DAX-backed files") which eventually lead to commit db5432165e9b5 ("KVM: x86/mmu: Walk host page tables to find THP mappings") to be less amenable to metadata changes. >> Still, I think it would be easier to teach record_subpages() that a >> PMD doesn't necessarily point to a high order page, eg do something >> like I suggested for the SGL where it extracts the page order and >> iterates over the contiguous range of pfns. > > But we also see good performance improvements from doing all reference > counts on the head page instead of spread throughout the pages, so we > really want compound pages. Going further than just refcounts and borrowing your (or someone else?) idea, perhaps also a FOLL_HEAD gup flag that would let us only work with head pages (or folios). Which would consequently let us pin/grab bigger swathes of memory e.g. 1G (in 2M head pages) or 512G (in 1G head pages) with just 1 page for storing the struct pages[*]. Albeit I suspect the numbers would have to justify it. Joao [*] One page happens to be what's used for RDMA/umem and vdpa as callers of pin_user_pages*()