On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@xxxxxxx> wrote: > > > > 在 2024/6/18 15:51, yangge1116 写道: > > > > > > 在 2024/6/18 下午2:58, Barry Song 写道: > >> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@xxxxxxx> wrote: > >>> > >>> > >>> > >>> 在 2024/6/18 下午12:10, Barry Song 写道: > >>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote: > >>>>> > >>>>> > >>>>> > >>>>> 在 2024/6/18 上午9:55, Barry Song 写道: > >>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx> > >>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: > >>>>>>>> > >>>>>>>> > >>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: > >>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote: > >>>>>>>>>> > >>>>>>>>>> From: yangge <yangge1116@xxxxxxx> > >>>>>>>>>> > >>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP > >>>>>>>>>> list for > >>>>>>>>>> THP-sized allocations") no longer differentiates the migration > >>>>>>>>>> type > >>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA > >>>>>>>>>> page from > >>>>>>>>>> the list, in some cases, it's not acceptable, for example, > >>>>>>>>>> allocating > >>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>>>>>>> > >>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from > >>>>>>>>>> THP-sized > >>>>>>>>>> PCP list to avoid the issue above. > >>>>>>>>> > >>>>>>>>> Could you please describe the impact on users in the commit log? > >>>>>>>> > >>>>>>>> If a large number of CMA memory are configured in the system (for > >>>>>>>> example, the CMA memory accounts for 50% of the system memory), > >>>>>>>> starting > >>>>>>>> virtual machine with device passthrough will get stuck. > >>>>>>>> > >>>>>>>> During starting virtual machine, it will call > >>>>>>>> pin_user_pages_remote(..., > >>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to > >>>>>>>> non-CMA > >>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation > >>>>>>>> requests > >>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless > >>>>>>>> loops. > >>>>>>>> > >>>>>>>> backtrace: > >>>>>>>> pin_user_pages_remote > >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>>>>> --------__get_user_pages_locked > >>>>>>>> --------check_and_migrate_movable_pages //always check fail and > >>>>>>>> continue > >>>>>>>> to migrate > >>>>>>>> ------------migrate_longterm_unpinnable_pages > >>>>>>>> ----------------alloc_migration_target // non-movable allocation > >>>>>>>> > >>>>>>>>> Is it possible that some CMA memory might be used by non-movable > >>>>>>>>> allocation requests? > >>>>>>>> > >>>>>>>> Yes. > >>>>>>>> > >>>>>>>> > >>>>>>>>> If so, will CMA somehow become unable to migrate, causing > >>>>>>>>> cma_alloc() > >>>>>>>>> to fail? > >>>>>>>> > >>>>>>>> > >>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If > >>>>>>>> non-movable allocation requests return CMA memory, > >>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to > >>>>>>>> another > >>>>>>>> CMA page, which is useless and cause endless loops in > >>>>>>>> __gup_longterm_locked(). > >>>>>> > >>>>>> This is only one perspective. We also need to consider the impact on > >>>>>> CMA itself. For example, > >>>>>> when CMA is borrowed by THP, and we need to reclaim it through > >>>>>> cma_alloc() or dma_alloc_coherent(), > >>>>>> we must move those pages out to ensure CMA's users can retrieve that > >>>>>> contiguous memory. > >>>>>> > >>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we > >>>>>> can't relocate them. > >>>>>> As a result, cma_alloc() is more likely to fail. > >>>>>> > >>>>>>>> > >>>>>>>> backtrace: > >>>>>>>> pin_user_pages_remote > >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>>>>> --------__get_user_pages_locked > >>>>>>>> --------check_and_migrate_movable_pages //always check fail and > >>>>>>>> continue > >>>>>>>> to migrate > >>>>>>>> ------------migrate_longterm_unpinnable_pages > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>>>>> THP-sized allocations") > >>>>>>>>>> Signed-off-by: yangge <yangge1116@xxxxxxx> > >>>>>>>>>> --- > >>>>>>>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>>>>>>> 1 file changed, 10 insertions(+) > >>>>>>>>>> > >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>>>>>>> index 2e22ce5..0bdf471 100644 > >>>>>>>>>> --- a/mm/page_alloc.c > >>>>>>>>>> +++ b/mm/page_alloc.c > >>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>>>>>>> *preferred_zone, > >>>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > >>>>>>>>>> > 1)); > >>>>>>>>>> > >>>>>>>>>> if (likely(pcp_allowed_order(order))) { > >>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>>>>>>> ALLOC_CMA || > >>>>>>>>>> + order != > >>>>>>>>>> HPAGE_PMD_ORDER) { > >>>>>>>>>> + page = rmqueue_pcplist(preferred_zone, > >>>>>>>>>> zone, > >>>>>>>>>> order, > >>>>>>>>>> + migratetype, > >>>>>>>>>> alloc_flags); > >>>>>>>>>> + if (likely(page)) > >>>>>>>>>> + goto out; > >>>>>>>>>> + } > >>>>>>>>> > >>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use > >>>>>>>>> PCP. > >>>>>>>>> But it > >>>>>>>>> still seems better than causing the failure of CMA allocation. > >>>>>>>>> > >>>>>>>>> Is there a possible approach to avoiding adding CMA THP into > >>>>>>>>> pcp from > >>>>>>>>> the first > >>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>>>>>>> > >>>>>>> > >>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >>>>>>> adding CMA THP into pcp may incur a slight performance penalty. > >>>>>>> > >>>>>> > >>>>>> But the majority of movable pages aren't CMA, right? > >>>>> > >>>>>> Do we have an estimate for > >>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new > >>>>>> cacheline, which > >>>>>> the original intention for THP was to avoid by having only one > >>>>>> PCP[1]? > >>>>>> > >>>>>> [1] > >>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/ > >>>>>> > >>>>> > >>>>> The size of struct per_cpu_pages is 256 bytes in current code > >>>>> containing > >>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized > >>>>> allocations"). > >>>>> crash> struct per_cpu_pages > >>>>> struct per_cpu_pages { > >>>>> spinlock_t lock; > >>>>> int count; > >>>>> int high; > >>>>> int high_min; > >>>>> int high_max; > >>>>> int batch; > >>>>> u8 flags; > >>>>> u8 alloc_factor; > >>>>> u8 expire; > >>>>> short free_count; > >>>>> struct list_head lists[13]; > >>>>> } > >>>>> SIZE: 256 > >>>>> > >>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP > >>>>> list > >>>>> for THP-sized allocations"), the size of struct per_cpu_pages is > >>>>> 272 bytes. > >>>>> crash> struct per_cpu_pages > >>>>> struct per_cpu_pages { > >>>>> spinlock_t lock; > >>>>> int count; > >>>>> int high; > >>>>> int high_min; > >>>>> int high_max; > >>>>> int batch; > >>>>> u8 flags; > >>>>> u8 alloc_factor; > >>>>> u8 expire; > >>>>> short free_count; > >>>>> struct list_head lists[15]; > >>>>> } > >>>>> SIZE: 272 > >>>>> > >>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>> THP-sized allocations") decrease one cacheline. > >>>> > >>>> the proposal is not reverting the patch but adding one CMA pcp. > >>>> so it is "struct list_head lists[14]"; in this case, the size is still > >>>> 256? > >>>> > >>> > >>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP > >>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other > >>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that > >>> right? > >> > >> i am not quite sure about MIGRATE_RECLAIMABLE as we want to > >> CMA is only used by movable. > >> So it might be: > >> MOVABLE and NON-MOVABLE. > > > > One PCP list is used by UNMOVABLE pages, and the other PCP list is used > > by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains > > MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP > > list contains MIGRATE_MOVABLE pages. > > > > Is the following modification feasiable? > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > -#define NR_PCP_THP 1 > +#define NR_PCP_THP 2 > +#define PCP_THP_MOVABLE 0 > +#define PCP_THP_UNMOVABLE 1 > #else > #define NR_PCP_THP 0 > #endif > > static inline unsigned int order_to_pindex(int migratetype, int order) > { > + int pcp_type = migratetype; > + > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > if (order > PAGE_ALLOC_COSTLY_ORDER) { > VM_BUG_ON(order != HPAGE_PMD_ORDER); > - return NR_LOWORDER_PCP_LISTS; > + > + if (migratetype != MIGRATE_MOVABLE) > + pcp_type = PCP_THP_UNMOVABLE; > + else > + pcp_type = PCP_THP_MOVABLE; > + > + return NR_LOWORDER_PCP_LISTS + pcp_type; > } > #else > VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > #endif > > - return (MIGRATE_PCPTYPES * order) + migratetype; > + return (MIGRATE_PCPTYPES * order) + pcp_type; > } > a minimum change might be, then you can drop most new code. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 120a317d0938..cfe1e0625e38 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason) static inline unsigned int order_to_pindex(int migratetype, int order) { + bool __maybe_unused movable; #ifdef CONFIG_CMA /* * We shouldn't get here for MIGRATE_CMA if those pages don't @@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int migratetype, int order) #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != pageblock_order); - return NR_LOWORDER_PCP_LISTS; + movable = migratetype == MIGRATE_MOVABLE; + return NR_LOWORDER_PCP_LISTS + movable; } #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > > > @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex) > int order = pindex / MIGRATE_PCPTYPES; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (pindex == NR_LOWORDER_PCP_LISTS) > + if (order > PAGE_ALLOC_COSTLY_ORDER) > order = HPAGE_PMD_ORDER; > #else > VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); > > > > >> > >>> > >>>> > >>>>> > >>>>>> > >>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >>>>>>> refer to it. > >>>>>>> > >>>>>>> > >>>>>>>>>> +#else > >>>>>>>>>> page = rmqueue_pcplist(preferred_zone, > >>>>>>>>>> zone, order, > >>>>>>>>>> migratetype, > >>>>>>>>>> alloc_flags); > >>>>>>>>>> if (likely(page)) > >>>>>>>>>> goto out; > >>>>>>>>>> +#endif > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, > >>>>>>>>>> alloc_flags, > >>>>>>>>>> -- > >>>>>>>>>> 2.7.4 > >>>>>>>>> > >>>>>>>>> Thanks > >>>>>>>>> Barry > >>>>>>>>> > >>>>>>> > >>>>>>> > >>>>> > >>> >