Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

Barry Song <21cnbao@xxxxxxxxx> · Tue, 18 Jun 2024 18:58:47 +1200

On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@xxxxxxx> wrote:
>
>
>
> 在 2024/6/18 下午12:10, Barry Song 写道:
> > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote:
> >>
> >>
> >>
> >> 在 2024/6/18 上午9:55, Barry Song 写道:
> >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2024/6/17 下午8:47, yangge1116 写道:
> >>>>>
> >>>>>
> >>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
> >>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote:
> >>>>>>>
> >>>>>>> From: yangge <yangge1116@xxxxxxx>
> >>>>>>>
> >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>>>> THP-sized allocations") no longer differentiates the migration type
> >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
> >>>>>>> the list, in some cases, it's not acceptable, for example, allocating
> >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
> >>>>>>>
> >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
> >>>>>>> PCP list to avoid the issue above.
> >>>>>>
> >>>>>> Could you please describe the impact on users in the commit log?
> >>>>>
> >>>>> If a large number of CMA memory are configured in the system (for
> >>>>> example, the CMA memory accounts for 50% of the system memory), starting
> >>>>> virtual machine with device passthrough will get stuck.
> >>>>>
> >>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
> >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
> >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
> >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
> >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
> >>>>>
> >>>>> backtrace:
> >>>>> pin_user_pages_remote
> >>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>> --------__get_user_pages_locked
> >>>>> --------check_and_migrate_movable_pages //always check fail and continue
> >>>>> to migrate
> >>>>> ------------migrate_longterm_unpinnable_pages
> >>>>> ----------------alloc_migration_target // non-movable allocation
> >>>>>
> >>>>>> Is it possible that some CMA memory might be used by non-movable
> >>>>>> allocation requests?
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>
> >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
> >>>>>> to fail?
> >>>>>
> >>>>>
> >>>>> No, it will cause endless loops in __gup_longterm_locked(). If
> >>>>> non-movable allocation requests return CMA memory,
> >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
> >>>>> CMA page, which is useless and cause endless loops in
> >>>>> __gup_longterm_locked().
> >>>
> >>> This is only one perspective. We also need to consider the impact on
> >>> CMA itself. For example,
> >>> when CMA is borrowed by THP, and we need to reclaim it through
> >>> cma_alloc() or dma_alloc_coherent(),
> >>> we must move those pages out to ensure CMA's users can retrieve that
> >>> contiguous memory.
> >>>
> >>> Currently, CMA's memory is occupied by non-movable pages, meaning we
> >>> can't relocate them.
> >>> As a result, cma_alloc() is more likely to fail.
> >>>
> >>>>>
> >>>>> backtrace:
> >>>>> pin_user_pages_remote
> >>>>> ----__gup_longterm_locked //cause endless loops in this function
> >>>>> --------__get_user_pages_locked
> >>>>> --------check_and_migrate_movable_pages //always check fail and continue
> >>>>> to migrate
> >>>>> ------------migrate_longterm_unpinnable_pages
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >>>>>>> THP-sized allocations")
> >>>>>>> Signed-off-by: yangge <yangge1116@xxxxxxx>
> >>>>>>> ---
> >>>>>>>     mm/page_alloc.c | 10 ++++++++++
> >>>>>>>     1 file changed, 10 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>>>> index 2e22ce5..0bdf471 100644
> >>>>>>> --- a/mm/page_alloc.c
> >>>>>>> +++ b/mm/page_alloc.c
> >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
> >>>>>>> *preferred_zone,
> >>>>>>>            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >>>>>>>
> >>>>>>>            if (likely(pcp_allowed_order(order))) {
> >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
> >>>>>>> ALLOC_CMA ||
> >>>>>>> +                                               order !=
> >>>>>>> HPAGE_PMD_ORDER) {
> >>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
> >>>>>>> order,
> >>>>>>> +                                               migratetype,
> >>>>>>> alloc_flags);
> >>>>>>> +                       if (likely(page))
> >>>>>>> +                               goto out;
> >>>>>>> +               }
> >>>>>>
> >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
> >>>>>> But it
> >>>>>> still seems better than causing the failure of CMA allocation.
> >>>>>>
> >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
> >>>>>> the first
> >>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
> >>>>>>
> >>>>
> >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
> >>>> adding CMA THP into pcp may incur a slight performance penalty.
> >>>>
> >>>
> >>> But the majority of movable pages aren't CMA, right?
> >>
> >>> Do we have an estimate for
> >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
> >>> the original intention for THP was to avoid by having only one PCP[1]?
> >>>
> >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/
> >>>
> >>
> >> The size of struct per_cpu_pages is 256 bytes in current code containing
> >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> >> allocations").
> >> crash> struct per_cpu_pages
> >> struct per_cpu_pages {
> >>       spinlock_t lock;
> >>       int count;
> >>       int high;
> >>       int high_min;
> >>       int high_max;
> >>       int batch;
> >>       u8 flags;
> >>       u8 alloc_factor;
> >>       u8 expire;
> >>       short free_count;
> >>       struct list_head lists[13];
> >> }
> >> SIZE: 256
> >>
> >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
> >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
> >> crash> struct per_cpu_pages
> >> struct per_cpu_pages {
> >>       spinlock_t lock;
> >>       int count;
> >>       int high;
> >>       int high_min;
> >>       int high_max;
> >>       int batch;
> >>       u8 flags;
> >>       u8 alloc_factor;
> >>       u8 expire;
> >>       short free_count;
> >>       struct list_head lists[15];
> >> }
> >> SIZE: 272
> >>
> >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
> >> THP-sized allocations") decrease one cacheline.
> >
> > the proposal is not reverting the patch but adding one CMA pcp.
> > so it is "struct list_head lists[14]"; in this case, the size is still
> > 256?
> >
>
> Yes, the size is still 256. If add one PCP list, we will have 2 PCP
> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?

i am not quite sure about MIGRATE_RECLAIMABLE as we want to
CMA is only used by movable.
So it might be:
MOVABLE and NON-MOVABLE.

>
> >
> >>
> >>>
> >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
> >>>> refer to it.
> >>>>
> >>>>
> >>>>>>> +#else
> >>>>>>>                    page = rmqueue_pcplist(preferred_zone, zone, order,
> >>>>>>>                                           migratetype, alloc_flags);
> >>>>>>>                    if (likely(page))
> >>>>>>>                            goto out;
> >>>>>>> +#endif
> >>>>>>>            }
> >>>>>>>
> >>>>>>>            page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
> >>>>>>> --
> >>>>>>> 2.7.4
> >>>>>>
> >>>>>> Thanks
> >>>>>> Barry
> >>>>>>
> >>>>
> >>>>
> >>
>