Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

Barry Song <21cnbao@xxxxxxxxx> · Wed, 19 Jun 2024 20:20:40 +1200

On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@xxxxxxx> wrote:
>
>
>
> 在 2024/6/18 15:51, yangge1116 写道:
> >
> >
> > 在 2024/6/18 下午2:58, Barry Song 写道:
> >> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@xxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> 在 2024/6/18 下午12:10, Barry Song 写道:
> >>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> 在 2024/6/18 上午9:55, Barry Song 写道:
> >>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
> >>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote:
> >>>>>>>>>>
> >>>>>>>>>> From: yangge <yangge1116@xxxxxxx>
> >>>>>>>>>>
> >>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP
> >>>>>>>>>> list for
> >>>>>>>>>> THP-sized allocations") no longer differentiates the migration
> >>>>>>>>>> type
> >>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA
> >>>>>>>>>> page from
> >>>>>>>>>> the list, in some cases, it's not acceptable, for example,
> >>>>>>>>>> allocating
> >>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>>>>>>
> >>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from
> >>>>>>>>>> THP-sized
> >>>>>>>>>> PCP list to avoid the issue above.
> >>>>>>>>>
> >>>>>>>>> Could you please describe the impact on users in the commit log?
> >>>>>>>>
> >>>>>>>> If a large number of CMA memory are configured in the system (for
> >>>>>>>> example, the CMA memory accounts for 50% of the system memory),
> >>>>>>>> starting
> >>>>>>>> virtual machine with device passthrough will get stuck.
> >>>>>>>>
> >>>>>>>> During starting virtual machine, it will call
> >>>>>>>> pin_user_pages_remote(...,
> >>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> >>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to
> >>>>>>>> non-CMA
> >>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation
> >>>>>>>> requests
> >>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless
> >>>>>>>> loops.
> >>>>>>>>
> >>>>>>>> backtrace:
> >>>>>>>> pin_user_pages_remote
> >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>>>>> --------__get_user_pages_locked
> >>>>>>>> --------check_and_migrate_movable_pages //always check fail and
> >>>>>>>> continue
> >>>>>>>> to migrate
> >>>>>>>> ------------migrate_longterm_unpinnable_pages
> >>>>>>>> ----------------alloc_migration_target // non-movable allocation
> >>>>>>>>
> >>>>>>>>> Is it possible that some CMA memory might be used by non-movable
> >>>>>>>>> allocation requests?
> >>>>>>>>
> >>>>>>>> Yes.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> If so, will CMA somehow become unable to migrate, causing
> >>>>>>>>> cma_alloc()
> >>>>>>>>> to fail?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
> >>>>>>>> non-movable allocation requests return CMA memory,
> >>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to
> >>>>>>>> another
> >>>>>>>> CMA page, which is useless and cause endless loops in
> >>>>>>>> __gup_longterm_locked().
> >>>>>>
> >>>>>> This is only one perspective. We also need to consider the impact on
> >>>>>> CMA itself. For example,
> >>>>>> when CMA is borrowed by THP, and we need to reclaim it through
> >>>>>> cma_alloc() or dma_alloc_coherent(),
> >>>>>> we must move those pages out to ensure CMA's users can retrieve that
> >>>>>> contiguous memory.
> >>>>>>
> >>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
> >>>>>> can't relocate them.
> >>>>>> As a result, cma_alloc() is more likely to fail.
> >>>>>>
> >>>>>>>>
> >>>>>>>> backtrace:
> >>>>>>>> pin_user_pages_remote
> >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>>>>> --------__get_user_pages_locked
> >>>>>>>> --------check_and_migrate_movable_pages //always check fail and
> >>>>>>>> continue
> >>>>>>>> to migrate
> >>>>>>>> ------------migrate_longterm_unpinnable_pages
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>>>>>>> THP-sized allocations")
> >>>>>>>>>> Signed-off-by: yangge <yangge1116@xxxxxxx>
> >>>>>>>>>> ---
> >>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++
> >>>>>>>>>>      1 file changed, 10 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>>>>>>> index 2e22ce5..0bdf471 100644
> >>>>>>>>>> --- a/mm/page_alloc.c
> >>>>>>>>>> +++ b/mm/page_alloc.c
> >>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>>>>>>> *preferred_zone,
> >>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order
> >>>>>>>>>> > 1));
> >>>>>>>>>>
> >>>>>>>>>>             if (likely(pcp_allowed_order(order))) {
> >>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>>>>>>>>> ALLOC_CMA ||
> >>>>>>>>>> +                                               order !=
> >>>>>>>>>> HPAGE_PMD_ORDER) {
> >>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone,
> >>>>>>>>>> zone,
> >>>>>>>>>> order,
> >>>>>>>>>> +                                               migratetype,
> >>>>>>>>>> alloc_flags);
> >>>>>>>>>> +                       if (likely(page))
> >>>>>>>>>> +                               goto out;
> >>>>>>>>>> +               }
> >>>>>>>>>
> >>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use
> >>>>>>>>> PCP.
> >>>>>>>>> But it
> >>>>>>>>> still seems better than causing the failure of CMA allocation.
> >>>>>>>>>
> >>>>>>>>> Is there a possible approach to avoiding adding CMA THP into
> >>>>>>>>> pcp from
> >>>>>>>>> the first
> >>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
> >>>>>>>>>
> >>>>>>>
> >>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> >>>>>>> adding CMA THP into pcp may incur a slight performance penalty.
> >>>>>>>
> >>>>>>
> >>>>>> But the majority of movable pages aren't CMA, right?
> >>>>>
> >>>>>> Do we have an estimate for
> >>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new
> >>>>>> cacheline, which
> >>>>>> the original intention for THP was to avoid by having only one
> >>>>>> PCP[1]?
> >>>>>>
> >>>>>> [1]
> >>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/
> >>>>>>
> >>>>>
> >>>>> The size of struct per_cpu_pages is 256 bytes in current code
> >>>>> containing
> >>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized
> >>>>> allocations").
> >>>>> crash> struct per_cpu_pages
> >>>>> struct per_cpu_pages {
> >>>>>        spinlock_t lock;
> >>>>>        int count;
> >>>>>        int high;
> >>>>>        int high_min;
> >>>>>        int high_max;
> >>>>>        int batch;
> >>>>>        u8 flags;
> >>>>>        u8 alloc_factor;
> >>>>>        u8 expire;
> >>>>>        short free_count;
> >>>>>        struct list_head lists[13];
> >>>>> }
> >>>>> SIZE: 256
> >>>>>
> >>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP
> >>>>> list
> >>>>> for THP-sized allocations"), the size of struct per_cpu_pages is
> >>>>> 272 bytes.
> >>>>> crash> struct per_cpu_pages
> >>>>> struct per_cpu_pages {
> >>>>>        spinlock_t lock;
> >>>>>        int count;
> >>>>>        int high;
> >>>>>        int high_min;
> >>>>>        int high_max;
> >>>>>        int batch;
> >>>>>        u8 flags;
> >>>>>        u8 alloc_factor;
> >>>>>        u8 expire;
> >>>>>        short free_count;
> >>>>>        struct list_head lists[15];
> >>>>> }
> >>>>> SIZE: 272
> >>>>>
> >>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations") decrease one cacheline.
> >>>>
> >>>> the proposal is not reverting the patch but adding one CMA pcp.
> >>>> so it is "struct list_head lists[14]"; in this case, the size is still
> >>>> 256?
> >>>>
> >>>
> >>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
> >>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
> >>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that
> >>> right?
> >>
> >> i am not quite sure about MIGRATE_RECLAIMABLE as we want to
> >> CMA is only used by movable.
> >> So it might be:
> >> MOVABLE and NON-MOVABLE.
> >
> > One PCP list is used by UNMOVABLE pages, and the other PCP list is used
> > by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains
> > MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP
> > list contains MIGRATE_MOVABLE pages.
> >
>
> Is the following modification feasiable?
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define NR_PCP_THP 1
> +#define NR_PCP_THP 2
> +#define PCP_THP_MOVABLE 0
> +#define PCP_THP_UNMOVABLE 1
>   #else
>   #define NR_PCP_THP 0
>   #endif
>
>   static inline unsigned int order_to_pindex(int migratetype, int order)
>   {
> +       int pcp_type = migratetype;
> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>          if (order > PAGE_ALLOC_COSTLY_ORDER) {
>                  VM_BUG_ON(order != HPAGE_PMD_ORDER);
> -               return NR_LOWORDER_PCP_LISTS;
> +
> +               if (migratetype != MIGRATE_MOVABLE)
> +                       pcp_type = PCP_THP_UNMOVABLE;
> +               else
> +                       pcp_type = PCP_THP_MOVABLE;
> +
> +               return NR_LOWORDER_PCP_LISTS + pcp_type;
>          }
>   #else
>          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>   #endif
>
> -       return (MIGRATE_PCPTYPES * order) + migratetype;
> +       return (MIGRATE_PCPTYPES * order) + pcp_type;
>   }
>

a minimum change might be, then you can drop most new code.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 120a317d0938..cfe1e0625e38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason)

 static inline unsigned int order_to_pindex(int migratetype, int order)
 {
+       bool __maybe_unused movable;
 #ifdef CONFIG_CMA
        /*
         * We shouldn't get here for MIGRATE_CMA if those pages don't
@@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int
migratetype, int order)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
        if (order > PAGE_ALLOC_COSTLY_ORDER) {
                VM_BUG_ON(order != pageblock_order);
-               return NR_LOWORDER_PCP_LISTS;
+               movable = migratetype == MIGRATE_MOVABLE;
+               return NR_LOWORDER_PCP_LISTS + movable;
        }
 #else
        VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);


>
>
> @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
>          int order = pindex / MIGRATE_PCPTYPES;
>
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -       if (pindex == NR_LOWORDER_PCP_LISTS)
> +       if (order > PAGE_ALLOC_COSTLY_ORDER)
>                  order = HPAGE_PMD_ORDER;
>   #else
>          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
>
>
>
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> >>>>>>> refer to it.
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> +#else
> >>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone,
> >>>>>>>>>> zone, order,
> >>>>>>>>>>                                            migratetype,
> >>>>>>>>>> alloc_flags);
> >>>>>>>>>>                     if (likely(page))
> >>>>>>>>>>                             goto out;
> >>>>>>>>>> +#endif
> >>>>>>>>>>             }
> >>>>>>>>>>
> >>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order,
> >>>>>>>>>> alloc_flags,
> >>>>>>>>>> --
> >>>>>>>>>> 2.7.4
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Barry
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>
>