Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance

Ge Yang <yangge1116@xxxxxxx> · Mon, 10 Feb 2025 16:56:17 +0800

在 2025/2/10 16:34, David Hildenbrand 写道:
On 10.02.25 02:56, yangge1116@xxxxxxx wrote:
From: yangge <yangge1116@xxxxxxx>

For different CMAs, concurrent allocation of CMA memory ideally should 
not
require synchronization using locks. Currently, a global cma_mutex 
lock is
employed to synchronize all CMA allocations, which can impact the
performance of concurrent allocations across different CMAs.

To test the performance impact, follow these steps:
1. Boot the kernel with the command line argument hugetlb_cma=30G to
    allocate a 30GB CMA area specifically for huge page allocations. 
(note:
    on my machine, which has 3 nodes, each node is initialized with 
10G of
    CMA)
2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
    count=30 to fully utilize the CMA area by writing zeroes to a file in
    /dev/shm.
3. Open three terminals and execute the following commands 
simultaneously:
    (Note: Each of these commands attempts to allocate 10GB [2621440 * 
4KB
    pages] of CMA memory.)
    On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/ 
alloc
    On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/ 
alloc
    On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/ 
alloc


Hi,

I'm curious, what is the real workload you are trying to optimize for? I 
assume this example here is just to have some measurement.

Some of our drivers require this optimization, but they haven't been 
merged into the mainline yet. Therefore, we optimize the debug code for 
them.
Is concurrency within a single CMA area also a problem for your use case?

Yes, we will first optimize the concurrent performance for multiple 
CMAs, and then proceed to optimize the concurrent performance for a 
single CMA later.

We attempt to allocate pages through the CMA debug interface and use the
time command to measure the duration of each allocation.
Performance comparison:
              Without this patch      With this patch
Terminal1        ~7s                     ~7s
Terminal2       ~14s                     ~8s
Terminal3       ~21s                     ~7s

To slove problem above, we could use per-CMA locks to improve concurrent
allocation performance. This would allow each CMA to be managed
independently, reducing the need for a global lock and thus improving
scalability and performance.

Signed-off-by: yangge <yangge1116@xxxxxxx>
---

V2:
- update code and message suggested by Barry.

  mm/cma.c | 7 ++++---
  mm/cma.h | 1 +
  2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 34a4df2..a0d4d2f 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -34,7 +34,6 @@
  struct cma cma_areas[MAX_CMA_AREAS];
  unsigned int cma_area_count;
-static DEFINE_MUTEX(cma_mutex);
  static int __init __cma_declare_contiguous_nid(phys_addr_t base,
              phys_addr_t size, phys_addr_t limit,
@@ -175,6 +174,8 @@ static void __init cma_activate_area(struct cma *cma)
      spin_lock_init(&cma->lock);
+    mutex_init(&cma->alloc_mutex);
+
  #ifdef CONFIG_CMA_DEBUGFS
      INIT_HLIST_HEAD(&cma->mem_head);
      spin_lock_init(&cma->mem_head_lock);
@@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *cma, struct 
cma_memrange *cmr,
          spin_unlock_irq(&cma->lock);
          pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
-        mutex_lock(&cma_mutex);
+        mutex_lock(&cma->alloc_mutex);
          ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
-        mutex_unlock(&cma_mutex);
+        mutex_unlock(&cma->alloc_mutex);


As raised, a better approach might be to return -EAGAIN  in case we hit 
an isolated pageblock and deal with that more gracefully here (e.g., try 
another block, or retry this one if there are not others left, ...)

In any case, this change here looks like an improvement.

Acked-by: David Hildenbrand <david@xxxxxxxxxx>