On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote: > Add tracking of pages that were pinned via FOLL_PIN. > > As mentioned in the FOLL_PIN documentation, callers who effectively set > FOLL_PIN are required to ultimately free such pages via put_user_page(). > The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET > for DIO and/or RDMA use". > > Pages that have been pinned via FOLL_PIN are identifiable via a > new function call: > > bool page_dma_pinned(struct page *page); > > What to do in response to encountering such a page, is left to later > patchsets. There is discussion about this in [1]. > > This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). > > This also has a couple of trivial, non-functional change fixes to > try_get_compound_head(). That function got moved to the top of the > file. Maybe split that as a separate trivial patch. > > This includes the following fix from Ira Weiny: > > DAX requires detection of a page crossing to a ref count of 1. Fix this > for GUP pages by introducing put_devmap_managed_user_page() which > accounts for GUP_PIN_COUNTING_BIAS now used by GUP. Please do the put_devmap_managed_page() changes in a separate patch, it would be a lot easier to follow, also on that front see comments below. > > [1] https://lwn.net/Articles/784574/ "Some slow progress on > get_user_pages()" > > Suggested-by: Jan Kara <jack@xxxxxxx> > Suggested-by: Jérôme Glisse <jglisse@xxxxxxxxxx> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> > --- > include/linux/mm.h | 80 +++++++++++---- > include/linux/mmzone.h | 2 + > include/linux/page_ref.h | 10 ++ > mm/gup.c | 213 +++++++++++++++++++++++++++++++-------- > mm/huge_memory.c | 32 +++++- > mm/hugetlb.c | 28 ++++- > mm/memremap.c | 4 +- > mm/vmstat.c | 2 + > 8 files changed, 300 insertions(+), 71 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index cdfb6fedb271..03b3600843b7 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page) > #endif > > #ifdef CONFIG_DEV_PAGEMAP_OPS > -void __put_devmap_managed_page(struct page *page); > +void __put_devmap_managed_page(struct page *page, int count); > DECLARE_STATIC_KEY_FALSE(devmap_managed_key); > -static inline bool put_devmap_managed_page(struct page *page) > + > +static inline bool page_is_devmap_managed(struct page *page) > { > if (!static_branch_unlikely(&devmap_managed_key)) > return false; > @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page) > switch (page->pgmap->type) { > case MEMORY_DEVICE_PRIVATE: > case MEMORY_DEVICE_FS_DAX: > - __put_devmap_managed_page(page); > return true; > default: > break; > @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page) > return false; > } > > +static inline bool put_devmap_managed_page(struct page *page) > +{ > + bool is_devmap = page_is_devmap_managed(page); > + > + if (is_devmap) { > + int count = page_ref_dec_return(page); > + > + __put_devmap_managed_page(page, count); > + } > + > + return is_devmap; > +} I think the __put_devmap_managed_page() should be rename to free_devmap_managed_page() and that the count != 1 case move to this inline function ie: static inline bool put_devmap_managed_page(struct page *page) { bool is_devmap = page_is_devmap_managed(page); if (is_devmap) { int count = page_ref_dec_return(page); /* * If refcount is 1 then page is freed and refcount is stable as nobody * holds a reference on the page. */ if (count == 1) free_devmap_managed_page(page, count); else if (!count) __put_page(page); } return is_devmap; } > + > #else /* CONFIG_DEV_PAGEMAP_OPS */ > static inline bool put_devmap_managed_page(struct page *page) > { > @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page) > return true; > } > > +__must_check bool user_page_ref_inc(struct page *page); > + What about having it as an inline here as it is pretty small. > static inline void put_page(struct page *page) > { > page = compound_head(page); > @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page) > __put_page(page); > } > > -/** > - * put_user_page() - release a gup-pinned page > - * @page: pointer to page to be released > +/* > + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload > + * the page's refcount so that two separate items are tracked: the original page > + * reference count, and also a new count of how many get_user_pages() calls were > + * made against the page. ("gup-pinned" is another term for the latter). > + * > + * With this scheme, get_user_pages() becomes special: such pages are marked > + * as distinct from normal pages. As such, the new put_user_page() call (and > + * its variants) must be used in order to release gup-pinned pages. > + * > + * Choice of value: > * > - * Pages that were pinned via get_user_pages*() must be released via > - * either put_user_page(), or one of the put_user_pages*() routines > - * below. This is so that eventually, pages that are pinned via > - * get_user_pages*() can be separately tracked and uniquely handled. In > - * particular, interactions with RDMA and filesystems need special > - * handling. > + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference > + * counts with respect to get_user_pages() and put_user_page() becomes simpler, > + * due to the fact that adding an even power of two to the page refcount has > + * the effect of using only the upper N bits, for the code that counts up using > + * the bias value. This means that the lower bits are left for the exclusive > + * use of the original code that increments and decrements by one (or at least, > + * by much smaller values than the bias value). > * > - * put_user_page() and put_page() are not interchangeable, despite this early > - * implementation that makes them look the same. put_user_page() calls must > - * be perfectly matched up with get_user_page() calls. > + * Of course, once the lower bits overflow into the upper bits (and this is > + * OK, because subtraction recovers the original values), then visual inspection > + * no longer suffices to directly view the separate counts. However, for normal > + * applications that don't have huge page reference counts, this won't be an > + * issue. > + * > + * Locking: the lockless algorithm described in page_cache_get_speculative() > + * and page_cache_gup_pin_speculative() provides safe operation for > + * get_user_pages and page_mkclean and other calls that race to set up page > + * table entries. > */ > -static inline void put_user_page(struct page *page) > -{ > - put_page(page); > -} > +#define GUP_PIN_COUNTING_BIAS (1UL << 10) > > +void put_user_page(struct page *page); > void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, > bool make_dirty); > - > void put_user_pages(struct page **pages, unsigned long npages); > > +/** > + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*() > + * or pin_longterm_pages*() > + * @page: pointer to page to be queried. > + * @Return: True, if it is likely that the page has been "dma-pinned". > + * False, if the page is definitely not dma-pinned. > + */ Maybe add a small comment about wrap around :) > +static inline bool page_dma_pinned(struct page *page) > +{ > + return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS; > +} > + > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > #define SECTION_IN_PAGE_FLAGS > #endif [...] > diff --git a/mm/gup.c b/mm/gup.c > index 1aea48427879..c9727e65fad3 100644 > --- a/mm/gup.c > +++ b/mm/gup.c [...] > @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, > > pgmap = get_dev_pagemap(pfn, pgmap); > if (unlikely(!pgmap)) { > - undo_dev_pagemap(nr, nr_start, pages); > + undo_dev_pagemap(nr, nr_start, flags, pages); > return 0; > } > SetPageReferenced(page); > pages[*nr] = page; > - get_page(page); > + > + if (flags & FOLL_PIN) { > + if (unlikely(!user_page_ref_inc(page))) { > + undo_dev_pagemap(nr, nr_start, flags, pages); > + return 0; > + } Maybe add a comment about a case that should never happens ie user_page_ref_inc() fails after the second iteration of the loop as it would be broken and a bug to call undo_dev_pagemap() after the first iteration of that loop. Also i believe that this should never happens as if first iteration succeed than __page_cache_add_speculative() will succeed for all the iterations. Note that the pgmap case above follows that too ie the call to get_dev_pagemap() can only fail on first iteration of the loop, well i assume you can never have a huge device page that span different pgmap ie different devices (which is a reasonable assumption). So maybe this code needs fixing ie : pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) return 0; > + } else > + get_page(page); > + > (*nr)++; > pfn++; > } while (addr += PAGE_SIZE, addr != end); [...] > @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages, > unsigned long addr, len, end; > int nr = 0, ret = 0; > > - if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM))) > + if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN))) Maybe add a comments to explain, something like: /* * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN * * Note that get_user_pages_fast() imply FOLL_GET flag by default but * callers can over-ride this default to pin case by setting FOLL_PIN. */ > return -EINVAL; > > start = untagged_addr(start) & PAGE_MASK; > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 13cc93785006..66bf4c8b88f1 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c [...] > @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, > if (!*pgmap) > return ERR_PTR(-EFAULT); > page = pfn_to_page(pfn); > - get_page(page); > + > + if (flags & FOLL_GET) > + get_page(page); > + else if (flags & FOLL_PIN) > + if (unlikely(!user_page_ref_inc(page))) > + page = ERR_PTR(-ENOMEM); While i agree that user_page_ref_inc() (ie page_cache_add_speculative()) should never fails here as we are holding the pmd lock and thus no one can unmap the pmd and free the page it points to. I believe you should return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should not fail either for the same reason. Thus it would be better to have consistent error. Maybe also add a comments explaining that it should not fail here. > > return page; > } [...] > @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, > * device mapped pages can only be returned if the > * caller will manage the page reference count. > */ > - if (!(flags & FOLL_GET)) > + if (!(flags & (FOLL_GET | FOLL_PIN))) > return ERR_PTR(-EEXIST); Maybe add a comment that FOLL_GET or FOLL_PIN must be set. > pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; > @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, > if (!*pgmap) > return ERR_PTR(-EFAULT); > page = pfn_to_page(pfn); > - get_page(page); > + > + if (flags & FOLL_GET) > + get_page(page); > + else if (flags & FOLL_PIN) > + if (unlikely(!user_page_ref_inc(page))) > + page = ERR_PTR(-ENOMEM); Same as for follow_devmap_pmd() see above. > > return page; > } > @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > skip_mlock: > page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; > VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page); > + > if (flags & FOLL_GET) > get_page(page); > + else if (flags & FOLL_PIN) > + if (unlikely(!user_page_ref_inc(page))) > + page = NULL; This should not fail either as we are holding the pmd lock maybe add a comment. Dunno if we want a WARN() or something to catch this degenerate case, or dump the page. > > out: > return page; > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index b45a95363a84..da335b1cd798 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > same_page: > if (pages) { > pages[i] = mem_map_offset(page, pfn_offset); > - get_page(pages[i]); > + > + if (flags & FOLL_GET) > + get_page(pages[i]); > + else if (flags & FOLL_PIN) > + if (unlikely(!user_page_ref_inc(pages[i]))) { > + spin_unlock(ptl); > + remainder = 0; > + err = -ENOMEM; > + WARN_ON_ONCE(1); > + break; > + } > } user_page_ref_inc() should not fail here either because we hold the ptl, so the WAR_ON_ONCE() is right but maybe add a comment. > > if (vmas) [...] > @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, > pte = huge_ptep_get((pte_t *)pmd); > if (pte_present(pte)) { > page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); > + > if (flags & FOLL_GET) > get_page(page); > + else if (flags & FOLL_PIN) > + if (unlikely(!user_page_ref_inc(page))) { > + page = NULL; > + goto out; > + } This should not fail either (again holding pmd lock), dunno if we want a warn or something to catch this degenerate case. > } else { > if (is_hugetlb_entry_migration(pte)) { > spin_unlock(ptl); [...]