On 1/10/21 11:30 AM, Linus Torvalds wrote:
On Sat, Jan 9, 2021 at 7:51 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: Just having a bit in the page flags for "I already made this exclusive, and fork() will keep it so" is I feel the best option. In a way, "page is writable" right now _is_ that bit. By definition, if you have a writable page in an anonymous mapping, that is an exclusive user. But because "writable" has these interactions with other operations, it would be better if it was a harder bit than that "maybe_pinned()", though. It would be lovely if a regular non-pinning write-GUP just always set it, for example. "maybe_pinned()" is good enough for the fork() case, which is the one that matters for long-term pinning. But it's admittedly not perfect.
There is at least one way to improve this part of it--maybe. IMHO, a lot of the bits in page _refcount are still being wasted (even after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that there are many callers of gup/pup per page. If anyone points out that that is wrong, then the rest of this falls apart, but...if we were to make a rule that "only a very few FOLL_GET or FOLL_PIN pins are ever simultaneously allowed on a given page", then several things become possible: 1) "maybe dma pinned" could become "very likely dma-pinned", by allowing perhaps 23 bits for normal page refcounts (leaving just 8 bits for dma pins), thus all but ensuring no aliasing between normal refcounts and dma pins. The name change below to "likely" is not actually necessary, I'm just using it to make the point clear: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..28f7f64e888f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,7 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_BIAS (1U << 23) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, @@ -1274,7 +1274,7 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * @Return: True, if it is likely that the page has been "dma-pinned". * False, if the page is definitely not dma-pinned. */ -static inline bool page_maybe_dma_pinned(struct page *page) +static inline bool page_likely_dma_pinned(struct page *page) { if (hpage_pincount_available(page)) return compound_pincount(page) > 0; 2) Doing (1) allows, arguably, failing mprotect calls if page_likely_dma_pinned() returns true, because the level of confidence is much higher now. 3) An additional counter, for normal gup (FOLL_GET) is possible: divide up the top 8 bits into two separate counters of just 4 bits each. Only allow either FOLL_GET or FOLL_PIN (and btw, I'm now mentally equating FOLL_PIN with FOLL_LONGTERM), not both, on a given page. Like this: diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..ce5af27e3057 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1241,7 +1241,8 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define DMA_PIN_COUNTING_BIAS (1U << 23) +#define GUP_PIN_COUNTING_BIAS (1U << 27) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, So: FOLL_PIN: would use DMA_PIN_COUNTING_BIAS to increment page refcount. These are long term pins for dma. FOLL_GET: would use GUP_PIN_COUNTING_BIAS to increment page refcount. These are not long term pins. Doing (3) would require yet another release call variant: unpin_user_pages() would need to take a flag to say which refcount to release: FOLL_GET or FOLL_PIN. However, I think that's relatively easy (compared to the original effort of teasing apart generic put_page() calls, into unpin_user_pages() calls). We already have all the unpin_user_pages() calls in place, and so it's "merely" a matter of adding a flag to 74 call sites. The real question is whether we can get away with supporting only a very low count of FOLL_PIN and FOLL_GET pages. Can we? thanks, -- John Hubbard NVIDIA