On Fri 31-01-20 19:40:25, John Hubbard wrote: > For huge pages (and in fact, any compound page), the > GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail > page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS > (1024). That limits the number of huge pages that can be pinned. > > This patch removes that limitation, by using an exact form of pin > counting for compound pages of order > 1. The "order > 1" is required > because this approach uses the 3rd struct page in the compound page, and > order 1 compound pages only have two pages, so that won't work there. > > A new struct page field, hpage_pinned_refcount, has been added, > replacing a padding field in the union (so no new space is used). > > This enhancement also has a useful side effect: huge pages and compound > pages (of order > 1) do not suffer from the "potential false positives" > problem that is discussed in the page_dma_pinned() comment block. That > is because these compound pages have extra space for tracking things, so > they get exact pin counts instead of overloading page->_refcount. > > Documentation/core-api/pin_user_pages.rst is updated accordingly. > > Suggested-by: Jan Kara <jack@xxxxxxx> > Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> The patch looks good to me. You can add: Reviewed-by: Jan Kara <jack@xxxxxxx> Honza > --- > Documentation/core-api/pin_user_pages.rst | 40 +++++------- > include/linux/mm.h | 26 ++++++++ > include/linux/mm_types.h | 7 +- > mm/gup.c | 78 ++++++++++++++++++++--- > mm/hugetlb.c | 6 ++ > mm/page_alloc.c | 2 + > mm/rmap.c | 6 ++ > 7 files changed, 133 insertions(+), 32 deletions(-) > > diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst > index 9829345428f8..3f72b1ea1104 100644 > --- a/Documentation/core-api/pin_user_pages.rst > +++ b/Documentation/core-api/pin_user_pages.rst > @@ -52,8 +52,22 @@ Which flags are set by each wrapper > > For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup > flags the caller provides. The caller is required to pass in a non-null struct > -pages* array, and the function then pin pages by incrementing each by a special > -value. For now, that value is +1, just like get_user_pages*().:: > +pages* array, and the function then pins pages by incrementing each by a special > +value: GUP_PIN_COUNTING_BIAS. > + > +For huge pages (and in fact, any compound page of more than 2 pages), the > +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting > +is achieved, by using the 3rd struct page in the compound page. A new struct > +page field, hpage_pinned_refcount, has been added in order to support this. > + > +This approach for compound pages avoids the counting upper limit problems that > +are discussed below. Those limitations would have been aggravated severely by > +huge pages, because each tail page adds a refcount to the head page. And in > +fact, testing revealed that, without a separate hpage_pinned_refcount field, > +page overflows were seen in some huge page stress tests. > + > +This also means that huge pages and compound pages (of order > 1) do not suffer > +from the false positives problem that is mentioned below.:: > > Function > -------- > @@ -99,27 +113,6 @@ pages: > This also leads to limitations: there are only 31-10==21 bits available for a > counter that increments 10 bits at a time. > > -TODO: for 1GB and larger huge pages, this is cutting it close. That's because > -when pin_user_pages() follows such pages, it increments the head page by "1" > -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for > -pin_user_pages()) for each tail page. So if you have a 1GB huge page: > - > -* There are 256K (18 bits) worth of 4 KB tail pages. > -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, > - 10 bits at a time) > -* There are 21 - 18 == 3 bits available to count. Except that there aren't, > - because you need to allow for a few normal get_page() calls on the head page, > - as well. Fortunately, the approach of using addition, rather than "hard" > - bitfields, within page->_refcount, allows for sharing these bits gracefully. > - But we're still looking at about 8 references. > - > -This, however, is a missing feature more than anything else, because it's easily > -solved by addressing an obvious inefficiency in the original get_user_pages() > -approach of retrieving pages: stop treating all the pages as if they were > -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of > -this, so some work is required. Once that's in place, this limitation mostly > -disappears from view, because there will be ample refcounting range available. > - > * Callers must specifically request "dma-pinned tracking of pages". In other > words, just calling get_user_pages() will not suffice; a new set of functions, > pin_user_page() and related, must be used. > @@ -228,5 +221,6 @@ References > * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ > * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ > * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ > +* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>` > > John Hubbard, October, 2019 > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ca787c606f0e..fdcd137b9981 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page) > return page[1].compound_order; > } > > +static inline bool hpage_pincount_available(struct page *page) > +{ > + /* > + * Can the page->hpage_pinned_refcount field be used? That field is in > + * the 3rd page of the compound page, so the smallest (2-page) compound > + * pages cannot support it. > + */ > + page = compound_head(page); > + return PageCompound(page) && compound_order(page) > 1; > +} > + > +static inline int compound_pincount(struct page *page) > +{ > + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); > + page = compound_head(page); > + return atomic_read(compound_pincount_ptr(page)); > +} > + > static inline void set_compound_order(struct page *page, unsigned int order) > { > page[1].compound_order = order; > @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages); > * refcounts, and b) all the callers of this routine are expected to be able to > * deal gracefully with a false positive. > * > + * For huge pages, the result will be exactly correct. That's because we have > + * more tracking data available: the 3rd struct page in the compound page is > + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS > + * scheme). > + * > * For more information, please see Documentation/vm/pin_user_pages.rst. > * > * @page: pointer to page to be queried. > @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages); > */ > static inline bool page_maybe_dma_pinned(struct page *page) > { > + if (hpage_pincount_available(page)) > + return compound_pincount(page) > 0; > + > /* > * page_ref_count() is signed. If that refcount overflows, then > * page_ref_count() returns a negative value, and callers will avoid > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index e87bb864bdb2..01e9717b8529 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -137,7 +137,7 @@ struct page { > }; > struct { /* Second tail page of compound page */ > unsigned long _compound_pad_1; /* compound_head */ > - unsigned long _compound_pad_2; > + atomic_t hpage_pinned_refcount; > /* For both global and memcg */ > struct list_head deferred_list; > }; > @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page) > return &page[1].compound_mapcount; > } > > +static inline atomic_t *compound_pincount_ptr(struct page *page) > +{ > + return &page[2].hpage_pinned_refcount; > +} > + > /* > * Used for sizing the vmemmap region on some architectures > */ > diff --git a/mm/gup.c b/mm/gup.c > index 6e8b773c233a..c10d0d051c5b 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -29,6 +29,22 @@ struct follow_page_context { > unsigned int page_mask; > }; > > +static void hpage_pincount_add(struct page *page, int refs) > +{ > + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); > + VM_BUG_ON_PAGE(page != compound_head(page), page); > + > + atomic_add(refs, compound_pincount_ptr(page)); > +} > + > +static void hpage_pincount_sub(struct page *page, int refs) > +{ > + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); > + VM_BUG_ON_PAGE(page != compound_head(page), page); > + > + atomic_sub(refs, compound_pincount_ptr(page)); > +} > + > /* > * Return the compound head page with ref appropriately incremented, > * or NULL if that failed. > @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, > if (flags & FOLL_GET) > return try_get_compound_head(page, refs); > else if (flags & FOLL_PIN) { > - refs *= GUP_PIN_COUNTING_BIAS; > - return try_get_compound_head(page, refs); > + /* > + * When pinning a compound page of order > 1 (which is what > + * hpage_pincount_available() checks for), use an exact count to > + * track it, via hpage_pincount_add/_sub(). > + * > + * However, be sure to *also* increment the normal page refcount > + * field at least once, so that the page really is pinned. > + */ > + if (!hpage_pincount_available(page)) > + refs *= GUP_PIN_COUNTING_BIAS; > + > + page = try_get_compound_head(page, refs); > + if (!page) > + return NULL; > + > + if (hpage_pincount_available(page)) > + hpage_pincount_add(page, refs); > + > + return page; > } > > WARN_ON_ONCE(1); > @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) > if (flags & FOLL_GET) > return try_get_page(page); > else if (flags & FOLL_PIN) { > + int refs = 1; > + > page = compound_head(page); > > if (WARN_ON_ONCE(page_ref_count(page) <= 0)) > return false; > > - page_ref_add(page, GUP_PIN_COUNTING_BIAS); > + if (hpage_pincount_available(page)) > + hpage_pincount_add(page, 1); > + else > + refs = GUP_PIN_COUNTING_BIAS; > + > + /* > + * Similar to try_grab_compound_head(): even if using the > + * hpage_pincount_add/_sub() routines, be sure to > + * *also* increment the normal page refcount field at least > + * once, so that the page really is pinned. > + */ > + page_ref_add(page, refs); > } > > return true; > @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) > #ifdef CONFIG_DEV_PAGEMAP_OPS > static bool __unpin_devmap_managed_user_page(struct page *page) > { > - int count; > + int count, refs = 1; > > if (!page_is_devmap_managed(page)) > return false; > > - count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); > + if (hpage_pincount_available(page)) > + hpage_pincount_sub(page, 1); > + else > + refs = GUP_PIN_COUNTING_BIAS; > + > + count = page_ref_sub_return(page, refs); > > /* > * devmap page refcounts are 1-based, rather than 0-based: if > @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page) > */ > void unpin_user_page(struct page *page) > { > + int refs = 1; > + > page = compound_head(page); > > /* > @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page) > if (__unpin_devmap_managed_user_page(page)) > return; > > - if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) > + if (hpage_pincount_available(page)) > + hpage_pincount_sub(page, 1); > + else > + refs = GUP_PIN_COUNTING_BIAS; > + > + if (page_ref_sub_and_test(page, refs)) > __put_page(page); > } > EXPORT_SYMBOL(unpin_user_page); > @@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr, > > static void put_compound_head(struct page *page, int refs, unsigned int flags) > { > - if (flags & FOLL_PIN) > - refs *= GUP_PIN_COUNTING_BIAS; > + if (flags & FOLL_PIN) { > + if (hpage_pincount_available(page)) > + hpage_pincount_sub(page, refs); > + else > + refs *= GUP_PIN_COUNTING_BIAS; > + } > > VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); > /* > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 487e998fd38e..07059d936f7b 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page, > struct page *p = page + 1; > > atomic_set(compound_mapcount_ptr(page), 0); > + if (hpage_pincount_available(page)) > + atomic_set(compound_pincount_ptr(page), 0); > + > for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { > clear_compound_head(p); > set_page_refcounted(p); > @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) > set_compound_head(p, page); > } > atomic_set(compound_mapcount_ptr(page), -1); > + > + if (hpage_pincount_available(page)) > + atomic_set(compound_pincount_ptr(page), 0); > } > > /* > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 15e908ad933b..c205b912f108 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order) > set_compound_head(p, page); > } > atomic_set(compound_mapcount_ptr(page), -1); > + if (hpage_pincount_available(page)) > + atomic_set(compound_pincount_ptr(page), 0); > } > > #ifdef CONFIG_DEBUG_PAGEALLOC > diff --git a/mm/rmap.c b/mm/rmap.c > index b3e381919835..e45b9b991e2f 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page, > VM_BUG_ON_PAGE(!PageTransHuge(page), page); > /* increment count (starts at -1) */ > atomic_set(compound_mapcount_ptr(page), 0); > + if (hpage_pincount_available(page)) > + atomic_set(compound_pincount_ptr(page), 0); > + > __inc_node_page_state(page, NR_ANON_THPS); > } else { > /* Anon THP always mapped first with PMD */ > @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page, > { > BUG_ON(address < vma->vm_start || address >= vma->vm_end); > atomic_set(compound_mapcount_ptr(page), 0); > + if (hpage_pincount_available(page)) > + atomic_set(compound_pincount_ptr(page), 0); > + > __page_set_anon_rmap(page, vma, address, 1); > } > #endif /* CONFIG_HUGETLB_PAGE */ > -- > 2.25.0 > -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR