Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 3 Jan 2022 20:33:52 -0400

On Fri, Dec 24, 2021 at 04:53:38AM +0000, Matthew Wilcox wrote:
> On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
> > On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > > > Right, from an API perspective we really want people to use FOLL_PIN.
> > > > 
> > > > To optimize this case in particular it would help if we would have the
> > > > FOLL flags on the unpin path. Then we could just decide internally
> > > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > > > this like a FOLL_GET instead". And we would need that as well if we were
> > > > to keep different counters for R/O vs. R/W pinned.
> > > 
> > > FYI, in my current tree, there's a gup_put_folio() which replaces
> > > put_compound_head:
> > > 
> > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> > > {
> > >         if (flags & FOLL_PIN) {
> > >                 node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> > >                 if (hpage_pincount_available(&folio->page))
> > >                         hpage_pincount_sub(&folio->page, refs);
> > >                 else
> > >                         refs *= GUP_PIN_COUNTING_BIAS;
> > >         }
> > > 
> > >         folio_put_refs(folio, refs);
> > > }
> > > 
> > > That can become non-static if it's needed.  I'm still working on that
> > > series, because I'd like to get it to a point where we return one
> > > folio pointer instead of N page pointers.  Not quite there yet.
> > 
> > I'm keen to see what that looks like, every driver I'm working on that
> > calls PUP goes through gyrations to recover contiguous pages, so this
> > is most welcomed!
> 
> I'm about to take some time off, so alas, you won't see it any time
> soon.  It'd be good to talk with some of the interested users because
> it's actually a pretty tricky problem.

Sure, it is a good idea

> We can't just return an array of the struct folios because the
> actual memory you want to access might be anywhere in that folio,
> and you don't want to have to redo the lookup just to find out which
> subpages of the folio are meant.

Yep

> So I'm currently thinking about returning a bio_vec:
> 
> struct bio_vec {
>         struct page     *bv_page;
>         unsigned int    bv_len;
>         unsigned int    bv_offset;
> };

The cases I'm looking at basically want an efficient list of physical
addresses + lengths. They don't care about pages or folios, eg often
the next step is to build a SGL and DMA map it which largely ignores
all of that.

As the memory used to hold the output of pin_user_pages() is all
temporary there is a sensitivity to allocate the memory quicky, but
also to have enough of it so that we don't have to do redundant work
in pin_user_pages() - eg traversing to the same PMD table again and
again.

> But now that each component in it is variable length, the caller can't
> know how large an array of bio_vecs to allocate.

And the array entry is now 2x the size and there is no way to scatter
the array to 4k segments?

> 1. The callee can allocate the array and let the caller free it when it's
>    finished

It is not bad, but a bit tricky, alot of the GUP code executes in an
IRQ disabled state, so it has to use a pre-allocating scheme. We also
can't scan everything twice and hope it didn't change, so exact
preallocation doesn't seem likely either.

> 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs
>    over (potentially) multiple calls.

It is slow, because we do redundant work traversing the same locks and
page tables again and again..

> 3. The caller can overallocate and ignore that most of the array isn't
>    used.
> 
> Any preferences?  I don't like #3.

#3 is OK for my applications because we immediately turn around and
copy the output to something else and free the memory anyhow...

However, being an array means we can't reliably allocate more than 4k
and with 16 bytes per entry that isn't even able to store a full PTE
table.

What would be nice for these cases is if the caller can supply an
array of 4k pages and GUP will fill them in. In many cases we'd
probably pass in up to 2M worth of pages or something.

There is some batching balance here where we want to minimize the
temporary memory consumed by GUP's output (and the time to allocate
it!) but also minimize the redundant work inside GUP repeatedly
walking the same tables and locks. 

eg ideally GUP would stop at some natural alignment boundary if it
could tell it can't fully fill the buffer. Then the next iteration
would not redo the same locks.

I was once thinking about something like storing an array of PFNs and
using the high bits to encode that the PFN is not 4k. It would allow
efficient packing of the common fragmented cases. To make it work
you'd need to have each 4k page grow the pfn list up from the start
and the pfn sizes down from the end. A size entry is only consumed if
the pfn bits can't encode the size directly so the packing can be a
perfect 8 bytes per PFN for the common 4k and 2M aligned cases.

Jason