On 12/8/20 7:57 PM, Jason Gunthorpe wrote: > On Tue, Dec 08, 2020 at 05:29:01PM +0000, Joao Martins wrote: >> Similar to follow_hugetlb_page() add a follow_devmap_page which rather >> than calling follow_page() per 4K page in a PMD/PUD it does so for the >> entire PMD, where we lock the pmd/pud, get all pages , unlock. >> >> While doing so, we only change the refcount once when PGMAP_COMPOUND is >> passed in. >> >> This let us improve {pin,get}_user_pages{,_longterm}() considerably: >> >> $ gup_benchmark -f /dev/dax0.2 -m 16384 -r 10 -S [-U,-b,-L] -n 512 -w >> >> (<test>) [before] -> [after] >> (get_user_pages 2M pages) ~150k us -> ~8.9k us >> (pin_user_pages 2M pages) ~192k us -> ~9k us >> (pin_user_pages_longterm 2M pages) ~200k us -> ~19k us >> >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> I've special-cased this to device-dax vmas given its similar page size >> guarantees as hugetlbfs, but I feel this is a bit wrong. I am >> replicating follow_hugetlb_page() as RFC ought to seek feedback whether >> this should be generalized if no fundamental issues exist. In such case, >> should I be changing follow_page_mask() to take either an array of pages >> or a function pointer and opaque arguments which would let caller pick >> its structure? > > I would be extremely sad if this was the only way to do this :( > > We should be trying to make things more general. Yeap, indeed. Specially, when similar problem is observed for THP, at least from the measurements I saw. It is all slow, except for hugetlbfs. > The > hmm_range_fault_path() doesn't have major special cases for device, I > am struggling to understand why gup fast and slow do. > > What we've talked about is changing the calling convention across all > of this to something like: > > struct gup_output { > struct page **cur; > struct page **end; > unsigned long vaddr; > [..] > } > > And making the manipulator like you saw for GUP common: > > gup_output_single_page() > gup_output_pages() > > Then putting this eveywhere. This is the pattern that we ended up with > in hmm_range_fault, and it seems to be working quite well. > > fast/slow should be much more symmetric in code than they are today, > IMHO.. Thanks for the suggestions above. I think those differences mainly exist because it used to be > siloed in arch code. Some of the differences might be bugs, we've seen > that a few times at least.. Interesting, wasn't aware of the siloing. I'll go investigate how this all refactoring goes together, at the point of which a future iteration of this particular patch probably needs to move independently from this series. Joao