Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas

Jason Gunthorpe <jgg@xxxxxxxx> · Tue, 8 Dec 2020 15:57:54 -0400

On Tue, Dec 08, 2020 at 05:29:01PM +0000, Joao Martins wrote:
> Similar to follow_hugetlb_page() add a follow_devmap_page which rather
> than calling follow_page() per 4K page in a PMD/PUD it does so for the
> entire PMD, where we lock the pmd/pud, get all pages , unlock.
> 
> While doing so, we only change the refcount once when PGMAP_COMPOUND is
> passed in.
> 
> This let us improve {pin,get}_user_pages{,_longterm}() considerably:
> 
> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w
> 
> (<test>) [before] -> [after]
> (get_user_pages 2M pages) ~150k us -> ~8.9k us
> (pin_user_pages 2M pages) ~192k us -> ~9k us
> (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us
> 
> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx>
> ---
> I've special-cased this to device-dax vmas given its similar page size
> guarantees as hugetlbfs, but I feel this is a bit wrong. I am
> replicating follow_hugetlb_page() as RFC ought to seek feedback whether
> this should be generalized if no fundamental issues exist. In such case,
> should I be changing follow_page_mask() to take either an array of pages
> or a function pointer and opaque arguments which would let caller pick
> its structure?

I would be extremely sad if this was the only way to do this :(

We should be trying to make things more general. The
hmm_range_fault_path() doesn't have major special cases for device, I
am struggling to understand why gup fast and slow do.

What we've talked about is changing the calling convention across all
of this to something like:

struct gup_output {
   struct page **cur;
   struct page **end;
   unsigned long vaddr;
   [..]
}

And making the manipulator like you saw for GUP common:

gup_output_single_page()
gup_output_pages()

Then putting this eveywhere. This is the pattern that we ended up with
in hmm_range_fault, and it seems to be working quite well.

fast/slow should be much more symmetric in code than they are today,
IMHO.. I think those differences mainly exist because it used to be
siloed in arch code. Some of the differences might be bugs, we've seen
that a few times at least..

Jason