On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote: > > > > 在 2024/6/18 上午9:55, Barry Song 写道: > > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx> wrote: > >> > >> > >> > >> 在 2024/6/17 下午8:47, yangge1116 写道: > >>> > >>> > >>> 在 2024/6/17 下午6:26, Barry Song 写道: > >>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote: > >>>>> > >>>>> From: yangge <yangge1116@xxxxxxx> > >>>>> > >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") no longer differentiates the migration type > >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from > >>>>> the list, in some cases, it's not acceptable, for example, allocating > >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>> > >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized > >>>>> PCP list to avoid the issue above. > >>>> > >>>> Could you please describe the impact on users in the commit log? > >>> > >>> If a large number of CMA memory are configured in the system (for > >>> example, the CMA memory accounts for 50% of the system memory), starting > >>> virtual machine with device passthrough will get stuck. > >>> > >>> During starting virtual machine, it will call pin_user_pages_remote(..., > >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA > >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests > >>> return CMA memory, pin_user_pages_remote() will enter endless loops. > >>> > >>> backtrace: > >>> pin_user_pages_remote > >>> ----__gup_longterm_locked //cause endless loops in this function > >>> --------__get_user_pages_locked > >>> --------check_and_migrate_movable_pages //always check fail and continue > >>> to migrate > >>> ------------migrate_longterm_unpinnable_pages > >>> ----------------alloc_migration_target // non-movable allocation > >>> > >>>> Is it possible that some CMA memory might be used by non-movable > >>>> allocation requests? > >>> > >>> Yes. > >>> > >>> > >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() > >>>> to fail? > >>> > >>> > >>> No, it will cause endless loops in __gup_longterm_locked(). If > >>> non-movable allocation requests return CMA memory, > >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another > >>> CMA page, which is useless and cause endless loops in > >>> __gup_longterm_locked(). > > > > This is only one perspective. We also need to consider the impact on > > CMA itself. For example, > > when CMA is borrowed by THP, and we need to reclaim it through > > cma_alloc() or dma_alloc_coherent(), > > we must move those pages out to ensure CMA's users can retrieve that > > contiguous memory. > > > > Currently, CMA's memory is occupied by non-movable pages, meaning we > > can't relocate them. > > As a result, cma_alloc() is more likely to fail. > > > >>> > >>> backtrace: > >>> pin_user_pages_remote > >>> ----__gup_longterm_locked //cause endless loops in this function > >>> --------__get_user_pages_locked > >>> --------check_and_migrate_movable_pages //always check fail and continue > >>> to migrate > >>> ------------migrate_longterm_unpinnable_pages > >>> > >>> > >>> > >>> > >>> > >>>>> > >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") > >>>>> Signed-off-by: yangge <yangge1116@xxxxxxx> > >>>>> --- > >>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>> 1 file changed, 10 insertions(+) > >>>>> > >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>> index 2e22ce5..0bdf471 100644 > >>>>> --- a/mm/page_alloc.c > >>>>> +++ b/mm/page_alloc.c > >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>> *preferred_zone, > >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > >>>>> > >>>>> if (likely(pcp_allowed_order(order))) { > >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>> ALLOC_CMA || > >>>>> + order != > >>>>> HPAGE_PMD_ORDER) { > >>>>> + page = rmqueue_pcplist(preferred_zone, zone, > >>>>> order, > >>>>> + migratetype, > >>>>> alloc_flags); > >>>>> + if (likely(page)) > >>>>> + goto out; > >>>>> + } > >>>> > >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. > >>>> But it > >>>> still seems better than causing the failure of CMA allocation. > >>>> > >>>> Is there a possible approach to avoiding adding CMA THP into pcp from > >>>> the first > >>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>> > >> > >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >> adding CMA THP into pcp may incur a slight performance penalty. > >> > > > > But the majority of movable pages aren't CMA, right? > > > Do we have an estimate for > > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > > the original intention for THP was to avoid by having only one PCP[1]? > > > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/ > > > > The size of struct per_cpu_pages is 256 bytes in current code containing > commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized > allocations"). > crash> struct per_cpu_pages > struct per_cpu_pages { > spinlock_t lock; > int count; > int high; > int high_min; > int high_max; > int batch; > u8 flags; > u8 alloc_factor; > u8 expire; > short free_count; > struct list_head lists[13]; > } > SIZE: 256 > > After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list > for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. > crash> struct per_cpu_pages > struct per_cpu_pages { > spinlock_t lock; > int count; > int high; > int high_min; > int high_max; > int batch; > u8 flags; > u8 alloc_factor; > u8 expire; > short free_count; > struct list_head lists[15]; > } > SIZE: 272 > > Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > THP-sized allocations") decrease one cacheline. the proposal is not reverting the patch but adding one CMA pcp. so it is "struct list_head lists[14]"; in this case, the size is still 256? > > > > >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >> refer to it. > >> > >> > >>>>> +#else > >>>>> page = rmqueue_pcplist(preferred_zone, zone, order, > >>>>> migratetype, alloc_flags); > >>>>> if (likely(page)) > >>>>> goto out; > >>>>> +#endif > >>>>> } > >>>>> > >>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, > >>>>> -- > >>>>> 2.7.4 > >>>> > >>>> Thanks > >>>> Barry > >>>> > >> > >> >