Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page

Ge Yang <yangge1116@xxxxxxx> · Wed, 19 Jun 2024 16:35:41 +0800

在 2024/6/19 16:20, Barry Song 写道:
On Wed, Jun 19, 2024 at 5:35 PM Ge Yang <yangge1116@xxxxxxx> wrote:



在 2024/6/18 15:51, yangge1116 写道:


在 2024/6/18 下午2:58, Barry Song 写道:
On Tue, Jun 18, 2024 at 6:56 PM yangge1116 <yangge1116@xxxxxxx> wrote:



在 2024/6/18 下午12:10, Barry Song 写道:
On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@xxxxxxx> wrote:



在 2024/6/18 上午9:55, Barry Song 写道:
On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@xxxxxxx>
wrote:



在 2024/6/17 下午8:47, yangge1116 写道:


在 2024/6/17 下午6:26, Barry Song 写道:
On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@xxxxxxx> wrote:

From: yangge <yangge1116@xxxxxxx>

Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP
list for
THP-sized allocations") no longer differentiates the migration
type
of pages in THP-sized PCP list, it's possible to get a CMA
page from
the list, in some cases, it's not acceptable, for example,
allocating
a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.

The patch forbids allocating non-CMA THP-sized page from
THP-sized
PCP list to avoid the issue above.

Could you please describe the impact on users in the commit log?

If a large number of CMA memory are configured in the system (for
example, the CMA memory accounts for 50% of the system memory),
starting
virtual machine with device passthrough will get stuck.

During starting virtual machine, it will call
pin_user_pages_remote(...,
FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
pin_user_pages_remote() will migrate the page from CMA area to
non-CMA
area because of FOLL_LONGTERM flag. If non-movable allocation
requests
return CMA memory, pin_user_pages_remote() will enter endless
loops.

backtrace:
pin_user_pages_remote
----__gup_longterm_locked //cause endless loops in this function
--------__get_user_pages_locked
--------check_and_migrate_movable_pages //always check fail and
continue
to migrate
------------migrate_longterm_unpinnable_pages
----------------alloc_migration_target // non-movable allocation

Is it possible that some CMA memory might be used by non-movable
allocation requests?

Yes.


If so, will CMA somehow become unable to migrate, causing
cma_alloc()
to fail?


No, it will cause endless loops in __gup_longterm_locked(). If
non-movable allocation requests return CMA memory,
migrate_longterm_unpinnable_pages() will migrate a CMA page to
another
CMA page, which is useless and cause endless loops in
__gup_longterm_locked().

This is only one perspective. We also need to consider the impact on
CMA itself. For example,
when CMA is borrowed by THP, and we need to reclaim it through
cma_alloc() or dma_alloc_coherent(),
we must move those pages out to ensure CMA's users can retrieve that
contiguous memory.

Currently, CMA's memory is occupied by non-movable pages, meaning we
can't relocate them.
As a result, cma_alloc() is more likely to fail.


backtrace:
pin_user_pages_remote
----__gup_longterm_locked //cause endless loops in this function
--------__get_user_pages_locked
--------check_and_migrate_movable_pages //always check fail and
continue
to migrate
------------migrate_longterm_unpinnable_pages






Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations")
Signed-off-by: yangge <yangge1116@xxxxxxx>
---
      mm/page_alloc.c | 10 ++++++++++
      1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e22ce5..0bdf471 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
*preferred_zone,
             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order
1));

             if (likely(pcp_allowed_order(order))) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
ALLOC_CMA ||
+                                               order !=
HPAGE_PMD_ORDER) {
+                       page = rmqueue_pcplist(preferred_zone,
zone,
order,
+                                               migratetype,
alloc_flags);
+                       if (likely(page))
+                               goto out;
+               }

This seems not ideal, because non-CMA THP gets no chance to use
PCP.
But it
still seems better than causing the failure of CMA allocation.

Is there a possible approach to avoiding adding CMA THP into
pcp from
the first
beginning? Otherwise, we might need a separate PCP for CMA.


The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
adding CMA THP into pcp may incur a slight performance penalty.


But the majority of movable pages aren't CMA, right?

Do we have an estimate for
adding back a CMA THP PCP? Will per_cpu_pages introduce a new
cacheline, which
the original intention for THP was to avoid by having only one
PCP[1]?

[1]
https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@xxxxxxxxxxxxxxxxxxx/


The size of struct per_cpu_pages is 256 bytes in current code
containing
commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized
allocations").
crash> struct per_cpu_pages
struct per_cpu_pages {
        spinlock_t lock;
        int count;
        int high;
        int high_min;
        int high_max;
        int batch;
        u8 flags;
        u8 alloc_factor;
        u8 expire;
        short free_count;
        struct list_head lists[13];
}
SIZE: 256

After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP
list
for THP-sized allocations"), the size of struct per_cpu_pages is
272 bytes.
crash> struct per_cpu_pages
struct per_cpu_pages {
        spinlock_t lock;
        int count;
        int high;
        int high_min;
        int high_max;
        int batch;
        u8 flags;
        u8 alloc_factor;
        u8 expire;
        short free_count;
        struct list_head lists[15];
}
SIZE: 272

Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
THP-sized allocations") decrease one cacheline.

the proposal is not reverting the patch but adding one CMA pcp.
so it is "struct list_head lists[14]"; in this case, the size is still
256?


Yes, the size is still 256. If add one PCP list, we will have 2 PCP
lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other
PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that
right?

i am not quite sure about MIGRATE_RECLAIMABLE as we want to
CMA is only used by movable.
So it might be:
MOVABLE and NON-MOVABLE.

One PCP list is used by UNMOVABLE pages, and the other PCP list is used
by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains
MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP
list contains MIGRATE_MOVABLE pages.


Is the following modification feasiable?

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define NR_PCP_THP 1
+#define NR_PCP_THP 2
+#define PCP_THP_MOVABLE 0
+#define PCP_THP_UNMOVABLE 1
   #else
   #define NR_PCP_THP 0
   #endif

   static inline unsigned int order_to_pindex(int migratetype, int order)
   {
+       int pcp_type = migratetype;
+
   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
          if (order > PAGE_ALLOC_COSTLY_ORDER) {
                  VM_BUG_ON(order != HPAGE_PMD_ORDER);
-               return NR_LOWORDER_PCP_LISTS;
+
+               if (migratetype != MIGRATE_MOVABLE)
+                       pcp_type = PCP_THP_UNMOVABLE;
+               else
+                       pcp_type = PCP_THP_MOVABLE;
+
+               return NR_LOWORDER_PCP_LISTS + pcp_type;
          }
   #else
          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
   #endif

-       return (MIGRATE_PCPTYPES * order) + migratetype;
+       return (MIGRATE_PCPTYPES * order) + pcp_type;
   }


a minimum change might be, then you can drop most new code.


Thanks, I will prepare the V2.


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 120a317d0938..cfe1e0625e38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,6 +588,7 @@ static void bad_page(struct page *page, const char *reason)

  static inline unsigned int order_to_pindex(int migratetype, int order)
  {
+       bool __maybe_unused movable;
  #ifdef CONFIG_CMA
         /*
          * We shouldn't get here for MIGRATE_CMA if those pages don't
@@ -600,7 +601,8 @@ static inline unsigned int order_to_pindex(int
migratetype, int order)
  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
         if (order > PAGE_ALLOC_COSTLY_ORDER) {
                 VM_BUG_ON(order != pageblock_order);
-               return NR_LOWORDER_PCP_LISTS;
+               movable = migratetype == MIGRATE_MOVABLE;
+               return NR_LOWORDER_PCP_LISTS + movable;
         }
  #else
         VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);




@@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex)
          int order = pindex / MIGRATE_PCPTYPES;

   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-       if (pindex == NR_LOWORDER_PCP_LISTS)
+       if (order > PAGE_ALLOC_COSTLY_ORDER)
                  order = HPAGE_PMD_ORDER;
   #else
          VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);








Commit 1d91df85f399 takes a similar approach to filter, and I mainly
refer to it.


+#else
                     page = rmqueue_pcplist(preferred_zone,
zone, order,
                                            migratetype,
alloc_flags);
                     if (likely(page))
                             goto out;
+#endif
             }

             page = rmqueue_buddy(preferred_zone, zone, order,
alloc_flags,
--
2.7.4

Thanks
Barry