Re: [PATCH 27/27] mm/hugetlb: enable bootmem allocation from CMA areas

Frank van der Linden <fvdl@xxxxxxxxxx> · Wed, 29 Jan 2025 14:43:45 -0800

On Tue, Jan 28, 2025 at 9:52 AM Frank van der Linden <fvdl@xxxxxxxxxx> wrote:
>
> Hi Christophe, thanks for your comments. Replies inline below.
>
> On Tue, Jan 28, 2025 at 12:55 AM Christophe Leroy
> <christophe.leroy@xxxxxxxxxx> wrote:
> >
> >
> >
> > Le 28/01/2025 à 00:22, Frank van der Linden a écrit :
> > > If hugetlb_cma_only is enabled, we know that hugetlb pages
> > > can only be allocated from CMA. Now that there is an interface
> > > to do early reservations from a CMA area (returning memblock
> > > memory), it can be used to allocate hugetlb pages from CMA.
> > >
> > > This also allows for doing pre-HVO on these pages (if enabled).
> > >
> > > Make sure to initialize the page structures and associated data
> > > correctly. Create a flag to signal that a hugetlb page has been
> > > allocated from CMA to make things a little easier.
> > >
> > > Some configurations of powerpc have a special hugetlb bootmem
> > > allocator, so introduce a boolean arch_specific_huge_bootmem_alloc
> > > that returns true if such an allocator is present. In that case,
> > > CMA bootmem allocations can't be used, so check that function
> > > before trying.
> > >
> > > Cc: Madhavan Srinivasan <maddy@xxxxxxxxxxxxx>
> > > Cc: Michael Ellerman <mpe@xxxxxxxxxxxxxx>
> > > Cc: linuxppc-dev@xxxxxxxxxxxxxxxx
> > > Signed-off-by: Frank van der Linden <fvdl@xxxxxxxxxx>
> > > ---
> > >   arch/powerpc/mm/hugetlbpage.c |   5 ++
> > >   include/linux/hugetlb.h       |   7 ++
> > >   mm/hugetlb.c                  | 135 +++++++++++++++++++++++++---------
> > >   3 files changed, 114 insertions(+), 33 deletions(-)
> > >
> > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> > > index d3c1b749dcfc..e53e4b4c8ef6 100644
> > > --- a/arch/powerpc/mm/hugetlbpage.c
> > > +++ b/arch/powerpc/mm/hugetlbpage.c
> > > @@ -121,6 +121,11 @@ bool __init hugetlb_node_alloc_supported(void)
> > >   {
> > >       return false;
> > >   }
> > > +
> > > +bool __init arch_specific_huge_bootmem_alloc(struct hstate *h)
> > > +{
> > > +     return (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled());
> > > +}
> > >   #endif
> > >
> > >
> > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > > index 2512463bca49..bca3052fb175 100644
> > > --- a/include/linux/hugetlb.h
> > > +++ b/include/linux/hugetlb.h
> > > @@ -591,6 +591,7 @@ enum hugetlb_page_flags {
> > >       HPG_freed,
> > >       HPG_vmemmap_optimized,
> > >       HPG_raw_hwp_unreliable,
> > > +     HPG_cma,
> > >       __NR_HPAGEFLAGS,
> > >   };
> > >
> > > @@ -650,6 +651,7 @@ HPAGEFLAG(Temporary, temporary)
> > >   HPAGEFLAG(Freed, freed)
> > >   HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
> > >   HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
> > > +HPAGEFLAG(Cma, cma)
> > >
> > >   #ifdef CONFIG_HUGETLB_PAGE
> > >
> > > @@ -678,14 +680,18 @@ struct hstate {
> > >       char name[HSTATE_NAME_LEN];
> > >   };
> > >
> > > +struct cma;
> > > +
> > >   struct huge_bootmem_page {
> > >       struct list_head list;
> > >       struct hstate *hstate;
> > >       unsigned long flags;
> > > +     struct cma *cma;
> > >   };
> > >
> > >   #define HUGE_BOOTMEM_HVO            0x0001
> > >   #define HUGE_BOOTMEM_ZONES_VALID    0x0002
> > > +#define HUGE_BOOTMEM_CMA             0x0004
> > >
> > >   bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
> > >
> > > @@ -711,6 +717,7 @@ bool __init hugetlb_node_alloc_supported(void);
> > >
> > >   void __init hugetlb_add_hstate(unsigned order);
> > >   bool __init arch_hugetlb_valid_size(unsigned long size);
> > > +bool __init arch_specific_huge_bootmem_alloc(struct hstate *h);
> >
> > Why adding 'specific' in the name ? Prefixing a function name with arch_
> > is usually enough to denote an architecture specific function.
>
> True, yes. That should probably be arch_has_huge_bootmem_alloc, I will
> change that.
>
> >
> > >   struct hstate *size_to_hstate(unsigned long size);
> > >
> > >   #ifndef HUGE_MAX_HSTATE
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 32ebde9039e2..183e8d0c2fb4 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -61,7 +61,7 @@ static struct cma *hugetlb_cma[MAX_NUMNODES];
> > >   static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
> > >   #endif
> > >   static bool hugetlb_cma_only;
> > > -static unsigned long hugetlb_cma_size __initdata;
> > > +static unsigned long hugetlb_cma_size;
> >
> > Why remove __initdata ? As far as I can see it is used only in
> > hugetlb_early_cma() and hugetlb_hstate_alloc_pages() which are __init
> > functions.
>
> hugetlb_cma_size is now used in alloc_gigantic_folio(), which is not
> an __init function. However, you got me thinking: since
> hugetlb_cma_only is only effective when hugetlb_cma_size != 0, I can
> just reset hugetlb_cma_only to false if hugetlb_cma_size == 0 after
> parsing the commandline arguments. This will revert hugetlb_cma_size
> to __initdata, and simplifies things a bit. I'll make that change in
> v2.
> >
> > >
> > >   __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
> > >   __initdata unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE];
> > > @@ -132,8 +132,10 @@ static void hugetlb_free_folio(struct folio *folio)
> > >   #ifdef CONFIG_CMA
> > >       int nid = folio_nid(folio);
> > >
> > > -     if (cma_free_folio(hugetlb_cma[nid], folio))
> > > +     if (folio_test_hugetlb_cma(folio)) {
> > > +             WARN_ON(!cma_free_folio(hugetlb_cma[nid], folio));
> >
> > Is that WARN_ON() needed ? See
> > https://docs.kernel.org/process/coding-style.html#do-not-crash-the-kernel
>
> Not strictly, I suppose, but if there is a CMA-allocated hugetlb
> folio, and cma_free fails, that would be a condition worthy of a
> warning, as the flag somehow got corrupted or there is an internal CMA
> error.  How about WARN_ON_ONCE?
>
> >
> >
> > >               return;
> > > +     }
> > >   #endif
> > >       folio_put(folio);
> > >   }
> > > @@ -1509,6 +1511,9 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
> > >                                       break;
> > >                       }
> > >               }
> > > +
> > > +             if (folio)
> > > +                     folio_set_hugetlb_cma(folio);
> > >       }
> > >   #endif
> > >       if (!folio) {
> > > @@ -3175,6 +3180,63 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > >       return ERR_PTR(-ENOSPC);
> > >   }
> > >
> > > +/*
> > > + * Some architectures do their own bootmem allocation, so they can't use
> > > + * early CMA allocation. So, allow for this function to be redefined.
> > > + */
> > > +bool __init __attribute((weak))
> >
> > Can't you use __weak ?
> >
> > By the way, do we really need a weak function here ? Can't it be a
> > static inline helper that gets waived by a macro defined by the arch,
> > something like:
> >
> > #ifndef arch_huge_bootmem_alloc
> > static inline arch_huge_bootmem_alloc(struct hstate *h)
> > {
> >         return false;
> > }
> > #endif
> >
> > Then powerpc does:
> >
> > #define arch_huge_bootmem_alloc arch_huge_bootmem_alloc
> > static inline arch_huge_bootmem_alloc(struct hstate *h)
> > {
> >         return (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled());
> > }
> >
>
> Fair enough, yeah. I used a weak symbol because that was already used
> for powerpc with alloc_bootmem_huge_page(), but I can change this one.
>
> >
> > But why is struct hstate *h parameter needed ? Seems like noone uses it.
>
> Correct - I merely extrapolated a bit and thought "well, architectures
> might have bootmem hugetlb allocators that only deal with certain
> sizes". But then again, like you say, there is currently no need for
> it. I'll remove the argument.
>
> >
> > > +arch_specific_huge_bootmem_alloc(struct hstate *h)
> > > +{
> > > +     return false;
> > > +}
> > > +
> > > +static bool __init hugetlb_early_cma(struct hstate *h)
> > > +{
> > > +     if (arch_specific_huge_bootmem_alloc(h))
> > > +             return false;
> > > +
> > > +     return (hstate_is_gigantic(h) && hugetlb_cma_size && hugetlb_cma_only);
> > > +}
> > > +
> > > +static __init void *alloc_bootmem(struct hstate *h, int nid)
> > > +{
> > > +     struct huge_bootmem_page *m;
> > > +     unsigned long flags;
> > > +     struct cma *cma;
> > > +
> > > +#ifdef CONFIG_CMA
> >
> >   #ifdefs in C files should be avoided, see
> > https://docs.kernel.org/process/coding-style.html#conditional-compilation
> >
> > > +     if (hugetlb_early_cma(h)) {
> > > +             flags = HUGE_BOOTMEM_CMA;
> > > +             cma = hugetlb_cma[nid];
> > > +             m = cma_reserve_early(cma, huge_page_size(h));
> > > +     } else
> > > +#endif
> >
> > This kind of if/else construct in uggly, should be avoided.
> >
>
> I found this ifdef hard to avoid, sadly, I tried various ways to avoid
> it (If (IS_ENABLED(CONFIG_CMA), etc), but came up short. I'll have
> another look for v2, but short of trying to split off all CMA-related
> code in to a different file, which would definitely be out of scope
> here, it might not end up being better.

I ended up moving the hugetlb_cma code to its own file anyway, which
is a new patch at the end of the v2 series I just sent.

- Frank