Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

yangge1116 <yangge1116@xxxxxxx> · Tue, 18 Jun 2024 14:55:48 +0800

在 2024/6/18 下午12:10, Barry Song 写道:
On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote:



在 2024/6/18 上午9:55, Barry Song 写道:
On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx> wrote:



在 2024/6/17 下午8:47, yangge1116 写道:


在 2024/6/17 下午6:26, Barry Song 写道:
On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote:

From: yangge <yangge1116@xxxxxxx>

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") no longer differentiates the migration type
of pages in THP-sized PCP list, it's possible to get a CMA page from
the list, in some cases, it's not acceptable, for example, allocating
a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.

The patch forbids allocating non-CMA THP-sized page from THP-sized
PCP list to avoid the issue above.

Could you please describe the impact on users in the commit log?

If a large number of CMA memory are configured in the system (for
example, the CMA memory accounts for 50% of the system memory), starting
virtual machine with device passthrough will get stuck.

During starting virtual machine, it will call pin_user_pages_remote(...,
FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
pin_user_pages_remote() will migrate the page from CMA area to non-CMA
area because of FOLL_LONGTERM flag. If non-movable allocation requests
return CMA memory, pin_user_pages_remote() will enter endless loops.

backtrace:
pin_user_pages_remote
----__gup_longterm_locked //cause endless loops in this function
--------__get_user_pages_locked
--------check_and_migrate_movable_pages //always check fail and continue
to migrate
------------migrate_longterm_unpinnable_pages
----------------alloc_migration_target // non-movable allocation

Is it possible that some CMA memory might be used by non-movable
allocation requests?

Yes.


If so, will CMA somehow become unable to migrate, causing cma_alloc()
to fail?


No, it will cause endless loops in __gup_longterm_locked(). If
non-movable allocation requests return CMA memory,
migrate_longterm_unpinnable_pages() will migrate a CMA page to another
CMA page, which is useless and cause endless loops in
__gup_longterm_locked().

This is only one perspective. We also need to consider the impact on
CMA itself. For example,
when CMA is borrowed by THP, and we need to reclaim it through
cma_alloc() or dma_alloc_coherent(),
we must move those pages out to ensure CMA's users can retrieve that
contiguous memory.

Currently, CMA's memory is occupied by non-movable pages, meaning we
can't relocate them.
As a result, cma_alloc() is more likely to fail.


backtrace:
pin_user_pages_remote
----__gup_longterm_locked //cause endless loops in this function
--------__get_user_pages_locked
--------check_and_migrate_movable_pages //always check fail and continue
to migrate
------------migrate_longterm_unpinnable_pages






Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations")
Signed-off-by: yangge <yangge1116@xxxxxxx>
---
    mm/page_alloc.c | 10 ++++++++++
    1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5..0bdf471 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
*preferred_zone,
           WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

           if (likely(pcp_allowed_order(order))) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
ALLOC_CMA ||
+                                               order !=
HPAGE_PMD_ORDER) {
+                       page = rmqueue_pcplist(preferred_zone, zone,
order,
+                                               migratetype,
alloc_flags);
+                       if (likely(page))
+                               goto out;
+               }

This seems not ideal, because non-CMA THP gets no chance to use PCP.
But it
still seems better than causing the failure of CMA allocation.

Is there a possible approach to avoiding adding CMA THP into pcp from
the first
beginning? Otherwise, we might need a separate PCP for CMA.


The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
adding CMA THP into pcp may incur a slight performance penalty.


But the majority of movable pages aren't CMA, right?

Do we have an estimate for
adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
the original intention for THP was to avoid by having only one PCP[1]?

[1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/


The size of struct per_cpu_pages is 256 bytes in current code containing
commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
allocations").
crash> struct per_cpu_pages
struct per_cpu_pages {
      spinlock_t lock;
      int count;
      int high;
      int high_min;
      int high_max;
      int batch;
      u8 flags;
      u8 alloc_factor;
      u8 expire;
      short free_count;
      struct list_head lists[13];
}
SIZE: 256

After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
crash> struct per_cpu_pages
struct per_cpu_pages {
      spinlock_t lock;
      int count;
      int high;
      int high_min;
      int high_max;
      int batch;
      u8 flags;
      u8 alloc_factor;
      u8 expire;
      short free_count;
      struct list_head lists[15];
}
SIZE: 272

Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") decrease one cacheline.

the proposal is not reverting the patch but adding one CMA pcp.
so it is "struct list_head lists[14]"; in this case, the size is still
256?


Yes, the size is still 256. If add one PCP list, we will have 2 PCP 
lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other 
PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?




Commit 1d91df85f399 takes a similar approach to filter, and I mainly
refer to it.


+#else
                   page = rmqueue_pcplist(preferred_zone, zone, order,
                                          migratetype, alloc_flags);
                   if (likely(page))
                           goto out;
+#endif
           }

           page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
--
2.7.4

Thanks
Barry