Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

Dan Williams <dan.j.williams@xxxxxxxxx> · Fri, 19 Feb 2021 17:43:18 -0800

On Tue, Dec 8, 2020 at 9:59 PM John Hubbard <jhubbard@xxxxxxxxxx> wrote:
>
> On 12/8/20 9:28 AM, Joao Martins wrote:
> > Add a new flag for struct dev_pagemap which designates that a a pagemap
>
> a a
>
> > is described as a set of compound pages or in other words, that how
> > pages are grouped together in the page tables are reflected in how we
> > describe struct pages. This means that rather than initializing
> > individual struct pages, we also initialize these struct pages, as
>
> Let's not say "rather than x, we also do y", because it's self-contradictory.
> I think you want to just leave out the "also", like this:
>
> "This means that rather than initializing> individual struct pages, we
> initialize these struct pages ..."
>
> Is that right?
>
> > compound pages (on x86: 2M or 1G compound pages)
> >
> > For certain ZONE_DEVICE users, like device-dax, which have a fixed page
> > size, this creates an opportunity to optimize GUP and GUP-fast walkers,
> > thus playing the same tricks as hugetlb pages.
> >
> > Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx>
> > ---
> >   include/linux/memremap.h | 2 ++
> >   mm/memremap.c            | 8 ++++++--
> >   mm/page_alloc.c          | 7 +++++++
> >   3 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 79c49e7f5c30..f8f26b2cc3da 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -90,6 +90,7 @@ struct dev_pagemap_ops {
> >   };
> >
> >   #define PGMAP_ALTMAP_VALID  (1 << 0)
> > +#define PGMAP_COMPOUND               (1 << 1)
> >
> >   /**
> >    * struct dev_pagemap - metadata for ZONE_DEVICE mappings
> > @@ -114,6 +115,7 @@ struct dev_pagemap {
> >       struct completion done;
> >       enum memory_type type;
> >       unsigned int flags;
> > +     unsigned int align;
>
> This also needs an "@aline" entry in the comment block above.
>
> >       const struct dev_pagemap_ops *ops;
> >       void *owner;
> >       int nr_range;
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 16b2fb482da1..287a24b7a65a 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
> >       memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> >                               PHYS_PFN(range->start),
> >                               PHYS_PFN(range_len(range)), pgmap);
> > -     percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> > -                     - pfn_first(pgmap, range_id));
> > +     if (pgmap->flags & PGMAP_COMPOUND)
> > +             percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> > +                     - pfn_first(pgmap, range_id)) / PHYS_PFN(pgmap->align));
>
> Is there some reason that we cannot use range_len(), instead of pfn_end() minus
> pfn_first()? (Yes, this more about the pre-existing code than about your change.)
>
> And if not, then why are the nearby range_len() uses OK? I realize that range_len()
> is simpler and skips a case, but it's not clear that it's required here. But I'm
> new to this area so be warned. :)

There's a subtle distinction between the range that was passed in and
the pfns that are activated inside of it. See the offset trickery in
pfn_first().

> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you
> happen to want, but is not named accordingly. Can you use or create something
> more accurately named? Like "number of pages in this large page"?

It's not the number of pages in a large page it's converting bytes to
pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
though process was if I'm going to add () might as well use a macro
that already does this.

That said I think this calculation is broken precisely because
pfn_first() makes the result unaligned.

Rather than fix the unaligned pfn_first() problem I would use this
support as an opportunity to revisit the option of storing pages in
the vmem_altmap reserve soace. The altmap's whole reason for existence
was that 1.5% of large PMEM might completely swamp DRAM. However if
that overhead is reduced by an order (or orders) of magnitude the
primary need for vmem_altmap vanishes.

Now, we'll still need to keep it around for the ->align == PAGE_SIZE
case, but for most part existing deployments that are specifying page
map on PMEM and an align > PAGE_SIZE can instead just transparently be
upgraded to page map on a smaller amount of DRAM.

>
> > +     else
> > +             percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> > +                             - pfn_first(pgmap, range_id));
> >       return 0;
> >
> >   err_add_memory:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index eaa227a479e4..9716ecd58e29 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6116,6 +6116,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
> >       unsigned long pfn, end_pfn = start_pfn + nr_pages;
> >       struct pglist_data *pgdat = zone->zone_pgdat;
> >       struct vmem_altmap *altmap = pgmap_altmap(pgmap);
> > +     bool compound = pgmap->flags & PGMAP_COMPOUND;
> > +     unsigned int align = PHYS_PFN(pgmap->align);
>
> Maybe align_pfn or pfn_align? Don't want the same name for things that are actually
> different types, in meaning anyway.

Good catch.