Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas

Joao Martins <joao.m.martins@xxxxxxxxxx> · Wed, 9 Dec 2020 11:19:04 +0000

On 12/8/20 7:57 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:29:01PM +0000, Joao Martins wrote:
>> Similar to follow_hugetlb_page() add a follow_devmap_page which rather
>> than calling follow_page() per 4K page in a PMD/PUD it does so for the
>> entire PMD, where we lock the pmd/pud, get all pages , unlock.
>>
>> While doing so, we only change the refcount once when PGMAP_COMPOUND is
>> passed in.
>>
>> This let us improve {pin,get}_user_pages{,_longterm}() considerably:
>>
>> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w
>>
>> (<test>) [before] -> [after]
>> (get_user_pages 2M pages) ~150k us -> ~8.9k us
>> (pin_user_pages 2M pages) ~192k us -> ~9k us
>> (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us
>>
>> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx>
>> ---
>> I've special-cased this to device-dax vmas given its similar page size
>> guarantees as hugetlbfs, but I feel this is a bit wrong. I am
>> replicating follow_hugetlb_page() as RFC ought to seek feedback whether
>> this should be generalized if no fundamental issues exist. In such case,
>> should I be changing follow_page_mask() to take either an array of pages
>> or a function pointer and opaque arguments which would let caller pick
>> its structure?
> 
> I would be extremely sad if this was the only way to do this :(
> 
> We should be trying to make things more general. 

Yeap, indeed.

Specially, when similar problem is observed for THP, at least from the
measurements I saw. It is all slow, except for hugetlbfs.

> The
> hmm_range_fault_path() doesn't have major special cases for device, I
> am struggling to understand why gup fast and slow do.
> 
> What we've talked about is changing the calling convention across all
> of this to something like:
> 
> struct gup_output {
>    struct page **cur;
>    struct page **end;
>    unsigned long vaddr;
>    [..]
> }
> 
> And making the manipulator like you saw for GUP common:
> 
> gup_output_single_page()
> gup_output_pages()
> 
> Then putting this eveywhere. This is the pattern that we ended up with
> in hmm_range_fault, and it seems to be working quite well.
> 
> fast/slow should be much more symmetric in code than they are today,
> IMHO.. 

Thanks for the suggestions above.

I think those differences mainly exist because it used to be
> siloed in arch code. Some of the differences might be bugs, we've seen
> that a few times at least..
Interesting, wasn't aware of the siloing.

I'll go investigate how this all refactoring goes together, at the point
of which a future iteration of this particular patch probably needs to
move independently from this series.

	Joao