Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

Barry Song <21cnbao@xxxxxxxxx> · Tue, 18 Jun 2024 16:10:08 +1200

On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote:
>
>
>
> 在 2024/6/18 上午9:55, Barry Song 写道:
> > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx> wrote:
> >>
> >>
> >>
> >> 在 2024/6/17 下午8:47, yangge1116 写道:
> >>>
> >>>
> >>> 在 2024/6/17 下午6:26, Barry Song 写道:
> >>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote:
> >>>>>
> >>>>> From: yangge <yangge1116@xxxxxxx>
> >>>>>
> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations") no longer differentiates the migration type
> >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>>>> the list, in some cases, it's not acceptable, for example, allocating
> >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>
> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>>>> PCP list to avoid the issue above.
> >>>>
> >>>> Could you please describe the impact on users in the commit log?
> >>>
> >>> If a large number of CMA memory are configured in the system (for
> >>> example, the CMA memory accounts for 50% of the system memory), starting
> >>> virtual machine with device passthrough will get stuck.
> >>>
> >>> During starting virtual machine, it will call pin_user_pages_remote(...,
> >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
> >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
> >>> return CMA memory, pin_user_pages_remote() will enter endless loops.
> >>>
> >>> backtrace:
> >>> pin_user_pages_remote
> >>> ----__gup_longterm_locked //cause endless loops in this function
> >>> --------__get_user_pages_locked
> >>> --------check_and_migrate_movable_pages //always check fail and continue
> >>> to migrate
> >>> ------------migrate_longterm_unpinnable_pages
> >>> ----------------alloc_migration_target // non-movable allocation
> >>>
> >>>> Is it possible that some CMA memory might be used by non-movable
> >>>> allocation requests?
> >>>
> >>> Yes.
> >>>
> >>>
> >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
> >>>> to fail?
> >>>
> >>>
> >>> No, it will cause endless loops in __gup_longterm_locked(). If
> >>> non-movable allocation requests return CMA memory,
> >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
> >>> CMA page, which is useless and cause endless loops in
> >>> __gup_longterm_locked().
> >
> > This is only one perspective. We also need to consider the impact on
> > CMA itself. For example,
> > when CMA is borrowed by THP, and we need to reclaim it through
> > cma_alloc() or dma_alloc_coherent(),
> > we must move those pages out to ensure CMA's users can retrieve that
> > contiguous memory.
> >
> > Currently, CMA's memory is occupied by non-movable pages, meaning we
> > can't relocate them.
> > As a result, cma_alloc() is more likely to fail.
> >
> >>>
> >>> backtrace:
> >>> pin_user_pages_remote
> >>> ----__gup_longterm_locked //cause endless loops in this function
> >>> --------__get_user_pages_locked
> >>> --------check_and_migrate_movable_pages //always check fail and continue
> >>> to migrate
> >>> ------------migrate_longterm_unpinnable_pages
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>>
> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>> THP-sized allocations")
> >>>>> Signed-off-by: yangge <yangge1116@xxxxxxx>
> >>>>> ---
> >>>>>    mm/page_alloc.c | 10 ++++++++++
> >>>>>    1 file changed, 10 insertions(+)
> >>>>>
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index 2e22ce5..0bdf471 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>> *preferred_zone,
> >>>>>           WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>>>
> >>>>>           if (likely(pcp_allowed_order(order))) {
> >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>>>> ALLOC_CMA ||
> >>>>> +                                               order !=
> >>>>> HPAGE_PMD_ORDER) {
> >>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
> >>>>> order,
> >>>>> +                                               migratetype,
> >>>>> alloc_flags);
> >>>>> +                       if (likely(page))
> >>>>> +                               goto out;
> >>>>> +               }
> >>>>
> >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
> >>>> But it
> >>>> still seems better than causing the failure of CMA allocation.
> >>>>
> >>>> Is there a possible approach to avoiding adding CMA THP into pcp from
> >>>> the first
> >>>> beginning? Otherwise, we might need a separate PCP for CMA.
> >>>>
> >>
> >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> >> adding CMA THP into pcp may incur a slight performance penalty.
> >>
> >
> > But the majority of movable pages aren't CMA, right?
>
> > Do we have an estimate for
> > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
> > the original intention for THP was to avoid by having only one PCP[1]?
> >
> > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/
> >
>
> The size of struct per_cpu_pages is 256 bytes in current code containing
> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> allocations").
> crash> struct per_cpu_pages
> struct per_cpu_pages {
>      spinlock_t lock;
>      int count;
>      int high;
>      int high_min;
>      int high_max;
>      int batch;
>      u8 flags;
>      u8 alloc_factor;
>      u8 expire;
>      short free_count;
>      struct list_head lists[13];
> }
> SIZE: 256
>
> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
> crash> struct per_cpu_pages
> struct per_cpu_pages {
>      spinlock_t lock;
>      int count;
>      int high;
>      int high_min;
>      int high_max;
>      int batch;
>      u8 flags;
>      u8 alloc_factor;
>      u8 expire;
>      short free_count;
>      struct list_head lists[15];
> }
> SIZE: 272
>
> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> THP-sized allocations") decrease one cacheline.

the proposal is not reverting the patch but adding one CMA pcp.
so it is "struct list_head lists[14]"; in this case, the size is still
256?

>
> >
> >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> >> refer to it.
> >>
> >>
> >>>>> +#else
> >>>>>                   page = rmqueue_pcplist(preferred_zone, zone, order,
> >>>>>                                          migratetype, alloc_flags);
> >>>>>                   if (likely(page))
> >>>>>                           goto out;
> >>>>> +#endif
> >>>>>           }
> >>>>>
> >>>>>           page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> >>>>> --
> >>>>> 2.7.4
> >>>>
> >>>> Thanks
> >>>> Barry
> >>>>
> >>
> >>
>