The patch titled Subject: mm/gup: track FOLL_PIN pages has been added to the -mm tree. Its filename is mm-gup-track-foll_pin-pages.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-gup-track-foll_pin-pages.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-gup-track-foll_pin-pages.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: John Hubbard <jhubbard@xxxxxxxxxx> Subject: mm/gup: track FOLL_PIN pages Add tracking of pages that were pinned via FOLL_PIN. As mentioned in the FOLL_PIN documentation, callers who effectively set FOLL_PIN are required to ultimately free such pages via unpin_user_page(). The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use". Pages that have been pinned via FOLL_PIN are identifiable via a new function call: bool page_dma_pinned(struct page *page); What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1], [2], and [3]. This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ Link: http://lkml.kernel.org/r/20191209225344.99740-25-jhubbard@xxxxxxxxxx Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> Suggested-by: Jan Kara <jack@xxxxxxx> Suggested-by: Jérôme Glisse <jglisse@xxxxxxxxxx> Cc: Alex Williamson <alex.williamson@xxxxxxxxxx> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxx> Cc: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> Cc: Björn Töpel <bjorn.topel@xxxxxxxxx> Cc: Christoph Hellwig <hch@xxxxxx> Cc: Daniel Vetter <daniel@xxxxxxxx> Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> Cc: Dave Chinner <david@xxxxxxxxxxxxx> Cc: David Airlie <airlied@xxxxxxxx> Cc: "David S . Miller" <davem@xxxxxxxxxxxxx> Cc: Hans Verkuil <hverkuil-cisco@xxxxxxxxx> Cc: Ira Weiny <ira.weiny@xxxxxxxxx> Cc: Jason Gunthorpe <jgg@xxxxxxxxxxxx> Cc: Jason Gunthorpe <jgg@xxxxxxxx> Cc: Jens Axboe <axboe@xxxxxxxxx> Cc: Jonathan Corbet <corbet@xxxxxxx> Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> Cc: Leon Romanovsky <leonro@xxxxxxxxxxxx> Cc: Magnus Karlsson <magnus.karlsson@xxxxxxxxx> Cc: Mauro Carvalho Chehab <mchehab@xxxxxxxxxx> Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx> Cc: Michal Hocko <mhocko@xxxxxxxx> Cc: Mike Kravetz <mike.kravetz@xxxxxxxxxx> Cc: Mike Rapoport <rppt@xxxxxxxxxxxxx> Cc: Paul Mackerras <paulus@xxxxxxxxx> Cc: Shuah Khan <shuah@xxxxxxxxxx> Cc: Vlastimil Babka <vbabka@xxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- Documentation/core-api/pin_user_pages.rst | 2 include/linux/mm.h | 75 +++- include/linux/mmzone.h | 2 include/linux/page_ref.h | 10 mm/gup.c | 338 +++++++++++++++----- mm/huge_memory.c | 23 + mm/hugetlb.c | 15 mm/vmstat.c | 2 8 files changed, 363 insertions(+), 104 deletions(-) --- a/Documentation/core-api/pin_user_pages.rst~mm-gup-track-foll_pin-pages +++ a/Documentation/core-api/pin_user_pages.rst @@ -53,7 +53,7 @@ Which flags are set by each wrapper For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup flags the caller provides. The caller is required to pass in a non-null struct pages* array, and the function then pin pages by incrementing each by a special -value. For now, that value is +1, just like get_user_pages*().:: +value: GUP_PIN_COUNTING_BIAS.:: Function -------- --- a/include/linux/mm.h~mm-gup-track-foll_pin-pages +++ a/include/linux/mm.h @@ -1016,6 +1016,8 @@ static inline void get_page(struct page page_ref_inc(page); } +void grab_page(struct page *page, unsigned int flags); + static inline __must_check bool try_get_page(struct page *page) { page = compound_head(page); @@ -1044,29 +1046,70 @@ static inline void put_page(struct page __put_page(page); } -/** - * unpin_user_page() - release a gup-pinned page - * @page: pointer to page to be released +/* + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload + * the page's refcount so that two separate items are tracked: the original page + * reference count, and also a new count of how many pin_user_pages() calls were + * made against the page. ("gup-pinned" is another term for the latter). + * + * With this scheme, pin_user_pages() becomes special: such pages are marked as + * distinct from normal pages. As such, the unpin_user_page() call (and its + * variants) must be used in order to release gup-pinned pages. + * + * Choice of value: + * + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference + * counts with respect to pin_user_pages() and unpin_user_page() becomes + * simpler, due to the fact that adding an even power of two to the page + * refcount has the effect of using only the upper N bits, for the code that + * counts up using the bias value. This means that the lower bits are left for + * the exclusive use of the original code that increments and decrements by one + * (or at least, by much smaller values than the bias value). * - * Pages that were pinned via pin_user_pages*() must be released via either - * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so - * that eventually such pages can be separately tracked and uniquely handled. In - * particular, interactions with RDMA and filesystems need special handling. - * - * unpin_user_page() and put_page() are not interchangeable, despite this early - * implementation that makes them look the same. unpin_user_page() calls must - * be perfectly matched up with pin*() calls. + * Of course, once the lower bits overflow into the upper bits (and this is + * OK, because subtraction recovers the original values), then visual inspection + * no longer suffices to directly view the separate counts. However, for normal + * applications that don't have huge page reference counts, this won't be an + * issue. + * + * Locking: the lockless algorithm described in page_cache_get_speculative() + * and page_cache_gup_pin_speculative() provides safe operation for + * get_user_pages and page_mkclean and other calls that race to set up page + * table entries. */ -static inline void unpin_user_page(struct page *page) -{ - put_page(page); -} +#define GUP_PIN_COUNTING_BIAS (1UL << 10) +void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, bool make_dirty); - void unpin_user_pages(struct page **pages, unsigned long npages); +/** + * page_dma_pinned() - report if a page is pinned for DMA. + * + * This function checks if a page has been pinned via a call to + * pin_user_pages*(). + * + * The return value is partially fuzzy: false is not fuzzy, because it means + * "definitely not pinned for DMA", but true means "probably pinned for DMA, but + * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth + * of normal page references". + * + * False positives are OK, because: a) it's unlikely for a page to get that many + * refcounts, and b) all the callers of this routine are expected to be able to + * deal gracefully with a false positive. + * + * For more information, please see Documentation/vm/pin_user_pages.rst. + * + * @page: pointer to page to be queried. + * @Return: True, if it is likely that the page has been "dma-pinned". + * False, if the page is definitely not dma-pinned. + */ +static inline bool page_dma_pinned(struct page *page) +{ + return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS; +} + #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define SECTION_IN_PAGE_FLAGS #endif --- a/include/linux/mmzone.h~mm-gup-track-foll_pin-pages +++ a/include/linux/mmzone.h @@ -244,6 +244,8 @@ enum node_stat_item { NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ + NR_FOLL_PIN_REQUESTED, /* via: pin_user_page(), gup flag: FOLL_PIN */ + NR_FOLL_PIN_RETURNED, /* pages returned via unpin_user_page() */ NR_VM_NODE_STAT_ITEMS }; --- a/include/linux/page_ref.h~mm-gup-track-foll_pin-pages +++ a/include/linux/page_ref.h @@ -102,6 +102,16 @@ static inline void page_ref_sub(struct p __page_ref_mod(page, -nr); } +static inline int page_ref_sub_return(struct page *page, int nr) +{ + int ret = atomic_sub_return(nr, &page->_refcount); + + if (page_ref_tracepoint_active(__tracepoint_page_ref_mod)) + __page_ref_mod(page, -nr); + + return ret; +} + static inline void page_ref_inc(struct page *page) { atomic_inc(&page->_refcount); --- a/mm/gup.c~mm-gup-track-foll_pin-pages +++ a/mm/gup.c @@ -36,6 +36,20 @@ static __always_inline long __gup_longte struct page **pages, struct vm_area_struct **vmas, unsigned int flags); + +#ifdef CONFIG_DEBUG_VM +static inline void __update_proc_vmstat(struct page *page, + enum node_stat_item item, int count) +{ + mod_node_page_state(page_pgdat(page), item, count); +} +#else +static inline void __update_proc_vmstat(struct page *page, + enum node_stat_item item, int count) +{ +} +#endif + /* * Return the compound head page with ref appropriately incremented, * or NULL if that failed. @@ -52,6 +66,138 @@ static inline struct page *try_get_compo } /** + * try_pin_compound_head() - mark a compound page as being used by + * pin_user_pages*(). + * + * This is the FOLL_PIN counterpart to try_get_compound_head(). + * + * @page: pointer to page to be marked + * @Return: the compound head page, with ref appropriately incremented, + * or NULL upon failure. + */ +__must_check struct page *try_pin_compound_head(struct page *page, int refs) +{ + struct page *head = try_get_compound_head(page, + GUP_PIN_COUNTING_BIAS * refs); + if (!head) + return NULL; + + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs); + return head; +} + +/* + * try_grab_compound_head() - attempt to elevate a page's refcount, by a + * flags-dependent amount. + * + * This has a default assumption of "use FOLL_GET behavior, if FOLL_PIN is not + * set". + * + * "grab" names in this file mean, "look at flags to decide with to use FOLL_PIN + * or FOLL_GET behavior, when incrementing the page's refcount. + */ +static struct page *try_grab_compound_head(struct page *page, int refs, + unsigned int flags) +{ + if (flags & FOLL_PIN) + return try_pin_compound_head(page, refs); + + return try_get_compound_head(page, refs); +} + +/** + * grab_page() - elevate a page's refcount by a flag-dependent amount + * + * This might not do anything at all, depending on the flags argument. + * + * "grab" names in this file mean, "look at flags to decide with to use FOLL_PIN + * or FOLL_GET behavior, when incrementing the page's refcount. + * + * @page: pointer to page to be grabbed + * @flags: gup flags: these are the FOLL_* flag values. + * + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same + * time. (That's true throughout the get_user_pages*() and pin_user_pages*() + * APIs.) Cases: + * + * FOLL_GET: page's refcount will be incremented by 1. + * FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS. + */ +void grab_page(struct page *page, unsigned int flags) +{ + if (flags & FOLL_GET) + get_page(page); + else if (flags & FOLL_PIN) { + get_page(page); + WARN_ON_ONCE(flags & FOLL_GET); + /* + * Use get_page(), above, to do the refcount error + * checking. Then just add in the remaining references: + */ + page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1); + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1); + } +} + +#ifdef CONFIG_DEV_PAGEMAP_OPS +static bool __unpin_devmap_managed_user_page(struct page *page) +{ + bool is_devmap = page_is_devmap_managed(page); + + if (is_devmap) { + int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); + + __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1); + /* + * devmap page refcounts are 1-based, rather than 0-based: if + * refcount is 1, then the page is free and the refcount is + * stable because nobody holds a reference on the page. + */ + if (count == 1) + free_devmap_managed_page(page); + else if (!count) + __put_page(page); + } + + return is_devmap; +} +#else +static bool __unpin_devmap_managed_user_page(struct page *page) +{ + return false; +} +#endif /* CONFIG_DEV_PAGEMAP_OPS */ + +/** + * unpin_user_page() - release a dma-pinned page + * @page: pointer to page to be released + * + * Pages that were pinned via pin_user_pages*() must be released via either + * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so + * that such pages can be separately tracked and uniquely handled. In + * particular, interactions with RDMA and filesystems need special handling. + */ +void unpin_user_page(struct page *page) +{ + page = compound_head(page); + + /* + * For devmap managed pages we need to catch refcount transition from + * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the + * page is free and we need to inform the device driver through + * callback. See include/linux/memremap.h and HMM for details. + */ + if (__unpin_devmap_managed_user_page(page)) + return; + + if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) + __put_page(page); + + __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1); +} +EXPORT_SYMBOL(unpin_user_page); + +/** * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages * @pages: array of pages to be maybe marked dirty, and definitely released. * @npages: number of pages in the @pages array. @@ -237,10 +383,11 @@ retry: } page = vm_normal_page(vma, address, pte); - if (!page && pte_devmap(pte) && (flags & FOLL_GET)) { + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* - * Only return device mapping pages in the FOLL_GET case since - * they are only valid while holding the pgmap reference. + * Only return device mapping pages in the FOLL_GET or FOLL_PIN + * case since they are only valid while holding the pgmap + * reference. */ *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap); if (*pgmap) @@ -278,11 +425,23 @@ retry: goto retry; } - if (flags & FOLL_GET) { + if (flags & (FOLL_PIN | FOLL_GET)) { + /* + * Allow try_get_page() to take care of error handling, for + * both cases: FOLL_GET or FOLL_PIN: + */ if (unlikely(!try_get_page(page))) { page = ERR_PTR(-ENOMEM); goto out; } + + if (flags & FOLL_PIN) { + WARN_ON_ONCE(flags & FOLL_GET); + + /* We got a +1 refcount from try_get_page(), above. */ + page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1); + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1); + } } if (flags & FOLL_TOUCH) { if ((flags & FOLL_WRITE) && @@ -544,8 +703,8 @@ static struct page *follow_page_mask(str /* make this handle hugepd */ page = follow_huge_addr(mm, address, flags & FOLL_WRITE); if (!IS_ERR(page)) { - BUG_ON(flags & FOLL_GET); - return page; + WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN)); + return NULL; } pgd = pgd_offset(mm, address); @@ -1131,6 +1290,36 @@ static __always_inline long __get_user_p return pages_done; } +static long __get_user_pages_remote(struct task_struct *tsk, + struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) +{ + /* + * Parts of FOLL_LONGTERM behavior are incompatible with + * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on + * vmas. However, this only comes up if locked is set, and there are + * callers that do request FOLL_LONGTERM, but do not set locked. So, + * allow what we can. + */ + if (gup_flags & FOLL_LONGTERM) { + if (WARN_ON_ONCE(locked)) + return -EINVAL; + /* + * This will check the vmas (even if our vmas arg is NULL) + * and return -ENOTSUPP if DAX isn't allowed in this case: + */ + return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, + vmas, gup_flags | FOLL_TOUCH | + FOLL_REMOTE); + } + + return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + locked, + gup_flags | FOLL_TOUCH | FOLL_REMOTE); +} + /* * get_user_pages_remote() - pin user pages in memory * @tsk: the task_struct to use for page fault accounting, or @@ -1205,28 +1394,8 @@ long get_user_pages_remote(struct task_s if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - /* - * Parts of FOLL_LONGTERM behavior are incompatible with - * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on - * vmas. However, this only comes up if locked is set, and there are - * callers that do request FOLL_LONGTERM, but do not set locked. So, - * allow what we can. - */ - if (gup_flags & FOLL_LONGTERM) { - if (WARN_ON_ONCE(locked)) - return -EINVAL; - /* - * This will check the vmas (even if our vmas arg is NULL) - * and return -ENOTSUPP if DAX isn't allowed in this case: - */ - return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, - vmas, gup_flags | FOLL_TOUCH | - FOLL_REMOTE); - } - - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, - locked, - gup_flags | FOLL_TOUCH | FOLL_REMOTE); + return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + pages, vmas, locked); } EXPORT_SYMBOL(get_user_pages_remote); @@ -1421,10 +1590,11 @@ finish_or_fault: return i ? : -EFAULT; } -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, - unsigned long start, unsigned long nr_pages, - unsigned int gup_flags, struct page **pages, - struct vm_area_struct **vmas, int *locked) +static long __get_user_pages_remote(struct task_struct *tsk, + struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + unsigned int gup_flags, struct page **pages, + struct vm_area_struct **vmas, int *locked) { return 0; } @@ -1864,13 +2034,17 @@ static inline pte_t gup_get_pte(pte_t *p #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start, + unsigned int flags, struct page **pages) { while ((*nr) - nr_start) { struct page *page = pages[--(*nr)]; ClearPageReferenced(page); - put_page(page); + if (flags & FOLL_PIN) + unpin_user_page(page); + else + put_page(page); } } @@ -1903,7 +2077,7 @@ static int gup_pte_range(pmd_t pmd, unsi pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); goto pte_unmap; } } else if (pte_special(pte)) @@ -1912,7 +2086,7 @@ static int gup_pte_range(pmd_t pmd, unsi VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); - head = try_get_compound_head(page, 1); + head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; @@ -1968,12 +2142,12 @@ static int __gup_device_huge(unsigned lo pgmap = get_dev_pagemap(pfn, pgmap); if (unlikely(!pgmap)) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } SetPageReferenced(page); pages[*nr] = page; - get_page(page); + grab_page(page, flags); (*nr)++; pfn++; } while (addr += PAGE_SIZE, addr != end); @@ -1995,7 +2169,7 @@ static int __gup_device_huge_pmd(pmd_t o return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } return 1; @@ -2013,7 +2187,7 @@ static int __gup_device_huge_pud(pud_t o return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) { - undo_dev_pagemap(nr, nr_start, pages); + undo_dev_pagemap(nr, nr_start, flags, pages); return 0; } return 1; @@ -2047,8 +2221,11 @@ static int record_subpages(struct page * return nr; } -static void put_compound_head(struct page *page, int refs) +static void put_compound_head(struct page *page, int refs, unsigned int flags) { + if (flags & FOLL_PIN) + refs *= GUP_PIN_COUNTING_BIAS; + /* Do a get_page() first, in case refs == page->_refcount */ get_page(page); page_ref_sub(page, refs); @@ -2088,12 +2265,12 @@ static int gup_hugepte(pte_t *ptep, unsi page = head + ((addr & (sz-1)) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); - head = try_get_compound_head(head, refs); + head = try_grab_compound_head(head, refs, flags); if (!head) return 0; if (unlikely(pte_val(pte) != pte_val(*ptep))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; } @@ -2148,12 +2325,12 @@ static int gup_huge_pmd(pmd_t orig, pmd_ page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); - head = try_get_compound_head(pmd_page(orig), refs); + head = try_grab_compound_head(pmd_page(orig), refs, flags); if (!head) return 0; if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; } @@ -2182,12 +2359,12 @@ static int gup_huge_pud(pud_t orig, pud_ page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); - head = try_get_compound_head(pud_page(orig), refs); + head = try_grab_compound_head(pud_page(orig), refs, flags); if (!head) return 0; if (unlikely(pud_val(orig) != pud_val(*pudp))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; } @@ -2211,12 +2388,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_ page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); - head = try_get_compound_head(pgd_page(orig), refs); + head = try_grab_compound_head(pgd_page(orig), refs, flags); if (!head) return 0; if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) { - put_compound_head(head, refs); + put_compound_head(head, refs, flags); return 0; } @@ -2517,9 +2694,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); /** * pin_user_pages_fast() - pin user pages in memory without taking locks * - * For now, this is a placeholder function, until various call sites are - * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, - * this is identical to get_user_pages_fast(). + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See + * get_user_pages_fast() for documentation on the function arguments, because + * the arguments here are identical. + * + * FOLL_PIN means that the pages must be released via unpin_user_page(). Please + * see Documentation/vm/pin_user_pages.rst for further details. * * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It * is NOT intended for Case 2 (RDMA: long-term pins). @@ -2527,21 +2707,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast); int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages) { - /* - * This is a placeholder, until the pin functionality is activated. - * Until then, just behave like the corresponding get_user_pages*() - * routine. - */ - return get_user_pages_fast(start, nr_pages, gup_flags, pages); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) + return -EINVAL; + + gup_flags |= FOLL_PIN; + return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages); } EXPORT_SYMBOL_GPL(pin_user_pages_fast); /** * pin_user_pages_remote() - pin pages of a remote process (task != current) * - * For now, this is a placeholder function, until various call sites are - * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, - * this is identical to get_user_pages_remote(). + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See + * get_user_pages_remote() for documentation on the function arguments, because + * the arguments here are identical. + * + * FOLL_PIN means that the pages must be released via unpin_user_page(). Please + * see Documentation/vm/pin_user_pages.rst for details. * * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It * is NOT intended for Case 2 (RDMA: long-term pins). @@ -2551,22 +2734,24 @@ long pin_user_pages_remote(struct task_s unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) { - /* - * This is a placeholder, until the pin functionality is activated. - * Until then, just behave like the corresponding get_user_pages*() - * routine. - */ - return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages, - vmas, locked); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) + return -EINVAL; + + gup_flags |= FOLL_PIN; + return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + pages, vmas, locked); } EXPORT_SYMBOL(pin_user_pages_remote); /** * pin_user_pages() - pin user pages in memory for use by other devices * - * For now, this is a placeholder function, until various call sites are - * converted to use the correct get_user_pages*() or pin_user_pages*() API. So, - * this is identical to get_user_pages(). + * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and + * FOLL_PIN is set. + * + * FOLL_PIN means that the pages must be released via unpin_user_page(). Please + * see Documentation/vm/pin_user_pages.rst for details. * * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It * is NOT intended for Case 2 (RDMA: long-term pins). @@ -2575,11 +2760,12 @@ long pin_user_pages(unsigned long start, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas) { - /* - * This is a placeholder, until the pin functionality is activated. - * Until then, just behave like the corresponding get_user_pages*() - * routine. - */ - return get_user_pages(start, nr_pages, gup_flags, pages, vmas); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE(gup_flags & FOLL_GET)) + return -EINVAL; + + gup_flags |= FOLL_PIN; + return __gup_longterm_locked(current, current->mm, start, nr_pages, + pages, vmas, gup_flags); } EXPORT_SYMBOL(pin_user_pages); --- a/mm/huge_memory.c~mm-gup-track-foll_pin-pages +++ a/mm/huge_memory.c @@ -945,6 +945,11 @@ struct page *follow_devmap_pmd(struct vm */ WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return NULL; + if (flags & FOLL_WRITE && !pmd_write(*pmd)) return NULL; @@ -960,7 +965,7 @@ struct page *follow_devmap_pmd(struct vm * device mapped pages can only be returned if the * caller will manage the page reference count. */ - if (!(flags & FOLL_GET)) + if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST); pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT; @@ -968,7 +973,7 @@ struct page *follow_devmap_pmd(struct vm if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - get_page(page); + grab_page(page, flags); return page; } @@ -1088,6 +1093,11 @@ struct page *follow_devmap_pud(struct vm if (flags & FOLL_WRITE && !pud_write(*pud)) return NULL; + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return NULL; + if (pud_present(*pud) && pud_devmap(*pud)) /* pass */; else @@ -1099,8 +1109,10 @@ struct page *follow_devmap_pud(struct vm /* * device mapped pages can only be returned if the * caller will manage the page reference count. + * + * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here: */ - if (!(flags & FOLL_GET)) + if (!(flags & (FOLL_GET | FOLL_PIN))) return ERR_PTR(-EEXIST); pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT; @@ -1108,7 +1120,7 @@ struct page *follow_devmap_pud(struct vm if (!*pgmap) return ERR_PTR(-EFAULT); page = pfn_to_page(pfn); - get_page(page); + grab_page(page, flags); return page; } @@ -1522,8 +1534,7 @@ struct page *follow_trans_huge_pmd(struc skip_mlock: page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page); - if (flags & FOLL_GET) - get_page(page); + grab_page(page, flags); out: return page; --- a/mm/hugetlb.c~mm-gup-track-foll_pin-pages +++ a/mm/hugetlb.c @@ -4356,7 +4356,7 @@ long follow_hugetlb_page(struct mm_struc same_page: if (pages) { pages[i] = mem_map_offset(page, pfn_offset); - get_page(pages[i]); + grab_page(pages[i], flags); } if (vmas) @@ -4916,6 +4916,12 @@ follow_huge_pmd(struct mm_struct *mm, un struct page *page = NULL; spinlock_t *ptl; pte_t pte; + + /* FOLL_GET and FOLL_PIN are mutually exclusive. */ + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == + (FOLL_PIN | FOLL_GET))) + return NULL; + retry: ptl = pmd_lockptr(mm, pmd); spin_lock(ptl); @@ -4928,8 +4934,7 @@ retry: pte = huge_ptep_get((pte_t *)pmd); if (pte_present(pte)) { page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT); - if (flags & FOLL_GET) - get_page(page); + grab_page(page, flags); } else { if (is_hugetlb_entry_migration(pte)) { spin_unlock(ptl); @@ -4950,7 +4955,7 @@ struct page * __weak follow_huge_pud(struct mm_struct *mm, unsigned long address, pud_t *pud, int flags) { - if (flags & FOLL_GET) + if (flags & (FOLL_GET | FOLL_PIN)) return NULL; return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT); @@ -4959,7 +4964,7 @@ follow_huge_pud(struct mm_struct *mm, un struct page * __weak follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags) { - if (flags & FOLL_GET) + if (flags & (FOLL_GET | FOLL_PIN)) return NULL; return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); --- a/mm/vmstat.c~mm-gup-track-foll_pin-pages +++ a/mm/vmstat.c @@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = { "nr_dirtied", "nr_written", "nr_kernel_misc_reclaimable", + "nr_foll_pin_requested", + "nr_foll_pin_returned", /* enum writeback_stat_item counters */ "nr_dirty_threshold", _ Patches currently in -mm which might be from jhubbard@xxxxxxxxxx are mm-gup-factor-out-duplicate-code-from-four-routines.patch mm-gup-move-try_get_compound_head-to-top-fix-minor-issues.patch mm-devmap-refactor-1-based-refcounting-for-zone_device-pages.patch goldish_pipe-rename-local-pin_user_pages-routine.patch mm-fix-get_user_pages_remotes-handling-of-foll_longterm.patch vfio-fix-foll_longterm-use-simplify-get_user_pages_remote-call.patch mm-gup-allow-foll_force-for-get_user_pages_fast.patch ib-umem-use-get_user_pages_fast-to-pin-dma-pages.patch mm-gup-introduce-pin_user_pages-and-foll_pin.patch goldish_pipe-convert-to-pin_user_pages-and-put_user_page.patch ib-corehwumem-set-foll_pin-via-pin_user_pages-fix-up-odp.patch mm-process_vm_access-set-foll_pin-via-pin_user_pages_remote.patch drm-via-set-foll_pin-via-pin_user_pages_fast.patch fs-io_uring-set-foll_pin-via-pin_user_pages.patch net-xdp-set-foll_pin-via-pin_user_pages.patch media-v4l2-core-set-pages-dirty-upon-releasing-dma-buffers.patch media-v4l2-core-pin_user_pages-foll_pin-and-put_user_page-conversion.patch vfio-mm-pin_user_pages-foll_pin-and-put_user_page-conversion.patch powerpc-book3s64-convert-to-pin_user_pages-and-put_user_page.patch powerpc-book3s64-convert-to-pin_user_pages-and-put_user_page-fix.patch mm-gup_benchmark-use-proper-foll_write-flags-instead-of-hard-coding-1.patch mm-tree-wide-rename-put_user_page-to-unpin_user_page.patch mm-gup-pass-flags-arg-to-__gup_device_-functions.patch mm-gup-track-foll_pin-pages.patch mm-gup_benchmark-support-pin_user_pages-and-related-calls.patch selftests-vm-run_vmtests-invoke-gup_benchmark-with-basic-foll_pin-coverage.patch