On 11/18/19 3:58 AM, Jan Kara wrote: > On Thu 14-11-19 21:53:33, John Hubbard wrote: >> Add tracking of pages that were pinned via FOLL_PIN. >> >> As mentioned in the FOLL_PIN documentation, callers who effectively set >> FOLL_PIN are required to ultimately free such pages via put_user_page(). >> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET >> for DIO and/or RDMA use". >> >> Pages that have been pinned via FOLL_PIN are identifiable via a >> new function call: >> >> bool page_dma_pinned(struct page *page); >> >> What to do in response to encountering such a page, is left to later >> patchsets. There is discussion about this in [1]. > ^^ missing this reference > in the changelog... I'll add that. > >> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). >> >> Suggested-by: Jan Kara <jack@xxxxxxx> >> Suggested-by: Jérôme Glisse <jglisse@xxxxxxxxxx> >> Signed-off-by: John Hubbard <jhubbard@xxxxxxxxxx> >> --- >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 6588d2e02628..db872766480f 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct page *page) >> return true; >> } >> >> +__must_check bool user_page_ref_inc(struct page *page); >> + >> static inline void put_page(struct page *page) >> { >> page = compound_head(page); >> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page) >> __put_page(page); >> } >> >> -/** >> - * put_user_page() - release a gup-pinned page >> - * @page: pointer to page to be released >> +/* >> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload >> + * the page's refcount so that two separate items are tracked: the original page >> + * reference count, and also a new count of how many get_user_pages() calls were > ^^ pin_user_pages() > >> + * made against the page. ("gup-pinned" is another term for the latter). >> + * >> + * With this scheme, get_user_pages() becomes special: such pages are marked > ^^^ pin_user_pages() > >> + * as distinct from normal pages. As such, the put_user_page() call (and its >> + * variants) must be used in order to release gup-pinned pages. >> + * >> + * Choice of value: >> * >> - * Pages that were pinned via pin_user_pages*() must be released via either >> - * put_user_page(), or one of the put_user_pages*() routines. This is so that >> - * eventually such pages can be separately tracked and uniquely handled. In >> - * particular, interactions with RDMA and filesystems need special handling. >> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference >> + * counts with respect to get_user_pages() and put_user_page() becomes simpler, > ^^^ pin_user_pages() > Yes. >> + * due to the fact that adding an even power of two to the page refcount has >> + * the effect of using only the upper N bits, for the code that counts up using >> + * the bias value. This means that the lower bits are left for the exclusive >> + * use of the original code that increments and decrements by one (or at least, >> + * by much smaller values than the bias value). >> * >> - * put_user_page() and put_page() are not interchangeable, despite this early >> - * implementation that makes them look the same. put_user_page() calls must >> - * be perfectly matched up with pin*() calls. >> + * Of course, once the lower bits overflow into the upper bits (and this is >> + * OK, because subtraction recovers the original values), then visual inspection >> + * no longer suffices to directly view the separate counts. However, for normal >> + * applications that don't have huge page reference counts, this won't be an >> + * issue. >> + * >> + * Locking: the lockless algorithm described in page_cache_get_speculative() >> + * and page_cache_gup_pin_speculative() provides safe operation for >> + * get_user_pages and page_mkclean and other calls that race to set up page >> + * table entries. >> */ > ... >> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr, >> page = head + ((addr & (sz-1)) >> PAGE_SHIFT); >> refs = __record_subpages(page, addr, end, pages + *nr); >> >> - head = try_get_compound_head(head, refs); >> - if (!head) >> - return 0; >> + if (flags & FOLL_PIN) { >> + head = page; >> + if (unlikely(!user_page_ref_inc(head))) >> + return 0; >> + head = page; > > Why do you assign 'head' twice? Also the refcounting logic is repeated > several times so perhaps you can factor it out in to a helper function or > even move it to __record_subpages()? OK. > >> + } else { >> + head = try_get_compound_head(head, refs); >> + if (!head) >> + return 0; >> + } >> >> if (unlikely(pte_val(pte) != pte_val(*ptep))) { >> put_compound_head(head, refs); > > So this will do the wrong thing for FOLL_PIN. We took just one "pin" > reference there but here we'll release 'refs' normal references AFAICT. > Also the fact that you take just one pin reference for each huge page > substantially changes how GUP refcounting works in the huge page case. > Currently, FOLL_GET users can be completely agnostic of huge pages. So you > can e.g. GUP whole 2 MB page, submit it as 2 different bios and then > drop page references from each bio completion function. With your new > FOLL_PIN behavior you cannot do that and I believe it will be a problem for > some users. So I think you have to maintain the behavior that you increase > the head->_refcount by (refs * GUP_PIN_COUNTING_BIAS) here. > Yes, completely agreed, this was a (big) oversight. I went through the same reasoning and reached your conclusions, in __gup_device_huge(), but then did it wrong in these functions. Will fix. thanks, -- John Hubbard NVIDIA