On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@xxxxxxx> wrote: > > > > 在 2024/6/18 下午12:10, Barry Song 写道: > > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote: > >> > >> > >> > >> 在 2024/6/18 上午9:55, Barry Song 写道: > >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx> wrote: > >>>> > >>>> > >>>> > >>>> 在 2024/6/17 下午8:47, yangge1116 写道: > >>>>> > >>>>> > >>>>> 在 2024/6/17 下午6:26, Barry Song 写道: > >>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote: > >>>>>>> > >>>>>>> From: yangge <yangge1116@xxxxxxx> > >>>>>>> > >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>> THP-sized allocations") no longer differentiates the migration type > >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>>>>>> the list, in some cases, it's not acceptable, for example, allocating > >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>>>> > >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>>>>>> PCP list to avoid the issue above. > >>>>>> > >>>>>> Could you please describe the impact on users in the commit log? > >>>>> > >>>>> If a large number of CMA memory are configured in the system (for > >>>>> example, the CMA memory accounts for 50% of the system memory), starting > >>>>> virtual machine with device passthrough will get stuck. > >>>>> > >>>>> During starting virtual machine, it will call pin_user_pages_remote(..., > >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA > >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests > >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. > >>>>> > >>>>> backtrace: > >>>>> pin_user_pages_remote > >>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>> --------__get_user_pages_locked > >>>>> --------check_and_migrate_movable_pages //always check fail and continue > >>>>> to migrate > >>>>> ------------migrate_longterm_unpinnable_pages > >>>>> ----------------alloc_migration_target // non-movable allocation > >>>>> > >>>>>> Is it possible that some CMA memory might be used by non-movable > >>>>>> allocation requests? > >>>>> > >>>>> Yes. > >>>>> > >>>>> > >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() > >>>>>> to fail? > >>>>> > >>>>> > >>>>> No, it will cause endless loops in __gup_longterm_locked(). If > >>>>> non-movable allocation requests return CMA memory, > >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another > >>>>> CMA page, which is useless and cause endless loops in > >>>>> __gup_longterm_locked(). > >>> > >>> This is only one perspective. We also need to consider the impact on > >>> CMA itself. For example, > >>> when CMA is borrowed by THP, and we need to reclaim it through > >>> cma_alloc() or dma_alloc_coherent(), > >>> we must move those pages out to ensure CMA's users can retrieve that > >>> contiguous memory. > >>> > >>> Currently, CMA's memory is occupied by non-movable pages, meaning we > >>> can't relocate them. > >>> As a result, cma_alloc() is more likely to fail. > >>> > >>>>> > >>>>> backtrace: > >>>>> pin_user_pages_remote > >>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>> --------__get_user_pages_locked > >>>>> --------check_and_migrate_movable_pages //always check fail and continue > >>>>> to migrate > >>>>> ------------migrate_longterm_unpinnable_pages > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>> > >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>> THP-sized allocations") > >>>>>>> Signed-off-by: yangge <yangge1116@xxxxxxx> > >>>>>>> --- > >>>>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>>>> 1 file changed, 10 insertions(+) > >>>>>>> > >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>>>> index 2e22ce5..0bdf471 100644 > >>>>>>> --- a/mm/page_alloc.c > >>>>>>> +++ b/mm/page_alloc.c > >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>>>> *preferred_zone, > >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>>>>>> > >>>>>>> if (likely(pcp_allowed_order(order))) { > >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>>>> ALLOC_CMA || > >>>>>>> + order != > >>>>>>> HPAGE_PMD_ORDER) { > >>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, > >>>>>>> order, > >>>>>>> + migratetype, > >>>>>>> alloc_flags); > >>>>>>> + if (likely(page)) > >>>>>>> + goto out; > >>>>>>> + } > >>>>>> > >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. > >>>>>> But it > >>>>>> still seems better than causing the failure of CMA allocation. > >>>>>> > >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from > >>>>>> the first > >>>>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>>>> > >>>> > >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >>>> adding CMA THP into pcp may incur a slight performance penalty. > >>>> > >>> > >>> But the majority of movable pages aren't CMA, right? > >> > >>> Do we have an estimate for > >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > >>> the original intention for THP was to avoid by having only one PCP[1]? > >>> > >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/ > >>> > >> > >> The size of struct per_cpu_pages is 256 bytes in current code containing > >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized > >> allocations"). > >> crash> struct per_cpu_pages > >> struct per_cpu_pages { > >> spinlock_t lock; > >> int count; > >> int high; > >> int high_min; > >> int high_max; > >> int batch; > >> u8 flags; > >> u8 alloc_factor; > >> u8 expire; > >> short free_count; > >> struct list_head lists[13]; > >> } > >> SIZE: 256 > >> > >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list > >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. > >> crash> struct per_cpu_pages > >> struct per_cpu_pages { > >> spinlock_t lock; > >> int count; > >> int high; > >> int high_min; > >> int high_max; > >> int batch; > >> u8 flags; > >> u8 alloc_factor; > >> u8 expire; > >> short free_count; > >> struct list_head lists[15]; > >> } > >> SIZE: 272 > >> > >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >> THP-sized allocations") decrease one cacheline. > > > > the proposal is not reverting the patch but adding one CMA pcp. > > so it is "struct list_head lists[14]"; in this case, the size is still > > 256? > > > > Yes, the size is still 256. If add one PCP list, we will have 2 PCP > lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other > PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right? i am not quite sure about MIGRATE_RECLAIMABLE as we want to CMA is only used by movable. So it might be: MOVABLE and NON-MOVABLE. > > > > >> > >>> > >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >>>> refer to it. > >>>> > >>>> > >>>>>>> +#else > >>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, > >>>>>>> migratetype, alloc_flags); > >>>>>>> if (likely(page)) > >>>>>>> goto out; > >>>>>>> +#endif > >>>>>>> } > >>>>>>> > >>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > >>>>>>> -- > >>>>>>> 2.7.4 > >>>>>> > >>>>>> Thanks > >>>>>> Barry > >>>>>> > >>>> > >>>> > >> >