[PATCH] mm: optimize memblock_add_range() for improved performance

Stephen Eta Zhou <stephen.eta.zhou@xxxxxxxxxxx> · Wed, 5 Feb 2025 05:55:50 +0000

Hi Mike Rapoport、Andrew Morton
I have recently been researching the mm subsystem of the Linux kernel, and I came across the memblock_add_range function, which piqued my interest. I found the implementation approach quite interesting, so I analyzed it and identified some areas for optimization. Starting with this part of the code:

if (type->cnt * 2 + 1 <= type->max)
      insert = true;

The idea here is good, but it has a certain flaw. The condition is rather restrictive, and it cannot be executed initially. Moreover, it is only valid when the remaining space is (2/1) + 1. If there is enough memory, but it does not satisfy (2/1) + 1, the insertion operation still needs to be performed twice.

So, I came up with a solution: delayed allocation. Since memory expansion is exponential, it means that performing around four expansions should be sufficient to handle the memory operations needed early in the kernel, before the buddy system takes over. Therefore, assuming that memory is adequate at the beginning, insertion can happen at any time. If memory is insufficient during the insertion, we record the operation (changing the insertion into a record operation). This involves logging the starting address of the remaining unused space and the number of insertions needed (this usually happens when resolving overlaps). After logging, memory allocation is performed, and then the insertion is attempted again.

The benefit of this approach is that it significantly reduces the time cost and also records the starting address of the remaining unused space. This way, the next time the operation begins, it doesn’t have to start from scratch, somewhat like a checkpointed transmission.

I optimized memblock_add_range according to my approach. Afterward, I tested it in the qemu-arm environment, and it worked properly. Additionally, I tested it in the linux/tools/testing/memblock directory, and it successfully passed all the test cases.

I used perf for performance profiling, and here are my diagnostic records (by the way, my CPU is a 13th Gen Intel(R) Core(TM) i7-13700):

The performance of memblock_add_range before the modification is as follows:
Samples: 3K of event 'cycles', Event count (approx.): 3853609007
Children     Self      Comm  Shar   Symbol 
1.32%         1.32%  main      main  [.] memblock_add_range.isra.0

After the modification:
Samples: 3K of event 'cycles', Event count (approx.): 3839056584
Children      Self       Comm  Shar   Symbol 
0.67%           0.67%  main     main  [.] memblock_add_range.isra.0
The optimal performance can reach 0.38%.
Samples: 3K of event 'cycles', Event count (approx.): 3839056584
Children      Self       Comm  Shar   Symbol 
0.38%           0.38%  main     main  [.] memblock_add_range.isra.0

To test the optimal and average utilization rates, I wrote a shell script to execute two versions of the code (one before modification and one after modification). It runs each version 100 times and analyzes the results of perf to calculate the average, minimum, and maximum utilization rates for memblock_add_range. Below are the results from the script:
After applying the patch, I measured the performance improvements using `perf` on my Intel i7-13700. The results show a significant reduction in the time spent in `memblock_add_range`:

- Before the patch:
  - Average: 1.22%
  - Max: 1.63%, Min: 0.93%

- After the patch:
  - Average: 0.69%
  - Max: 0.94%, Min: 0.50%

The optimization provides a 53% reduction in the average CPU time spent in this function, with the worst-case performance now close to the best-case performance before the optimization.

Here is my test script (it should be run only in the linux/tools/testing/memblock directory):
#!/bin/bash

PERF_DATA="perf.data"

TOTAL_RUNS=100

CHILDREN_PERCENTAGE=0
SELF_PERCENTAGE=0
CHILDREN_AVERAGE=0
SELF_AVERAGE=0
CHILDRENS=()
SELFS=()
MIN_CHILDREN=0
MAX_CHILDREN=0
MIN_SELF=0
MAX_SELF=0

function log()
{
    echo -e $*
    echo -e $* >perf_test.log
}

if [ -f "./perf_test.log" ]; then
    rm "./perf_test.log"
fi

touch perf_test.log

for i in $(seq 1 $TOTAL_RUNS)
do

    sudo perf record -e cycles -g ./main > /dev/null 2>&1

    read CHILDREN SELF <<< $(sudo perf report | grep "memblock_add_range.isra.0" | awk 'NR==2 {print $1, $2}' | sed 's/%//g')
    if [ -z $CHILDREN ]; then
        read CHILDREN SELF <<< $(sudo perf report | grep "memblock_add_range.isra.0" | awk 'NR==1 {print $1, $2}' | sed 's/%//g')
    fi

    if [ $MIN_CHILDREN == 0 ]; then
        MIN_CHILDREN=$CHILDREN
    fi

    if [ $MIN_SELF == 0 ]; then
        MIN_SELF=$SELF
    fi

    if (( $(echo "$CHILDREN > $MAX_CHILDREN" | bc -l) )); then
        MAX_CHILDREN=$CHILDREN
    elif (( $(echo "$CHILDREN < $MIN_CHILDREN" | bc -l) )); then
        MIN_CHILDREN=$CHILDREN
    fi

    if (( $(echo "$SELF > $MAX_SELF" | bc -l) )); then
        MAX_SELF=$SELF
    elif (( $(echo "$SELF < $MIN_SELF" | bc -l) )); then
        MIN_SELF=$SELF
    fi

    log "($i) memblock_add_range.isra.0: Children <$CHILDREN>, Self <$SELF>"

    CHILDRENS+=($CHILDREN)
    SELFS+=($SELF)

    sudo rm -f $PERF_DATA
done

for PERCENT in "${CHILDRENS[@]}"
do
    CHILDREN_PERCENTAGE=$(echo "$CHILDREN_PERCENTAGE + $PERCENT" | bc)
done

for PERCENT in "${SELFS[@]}"
do
    SELF_PERCENTAGE=$(echo "$SELF_PERCENTAGE + $PERCENT" | bc)
done

CHILDREN_AVERAGE=$(echo "scale=2; $CHILDREN_PERCENTAGE / $TOTAL_RUNS" | bc)
CHILDREN_AVERAGE=$(printf "%.2f" $CHILDREN_AVERAGE)
SELF_AVERAGE=$(echo "scale=2; $SELF_PERCENTAGE / $TOTAL_RUNS" | bc)
SELF_AVERAGE=$(printf "%.2f" $SELF_AVERAGE)

log ""
log "Result report:"
log "memblock_add_range.isra.0 Average: Children (Ave<$CHILDREN_AVERAGE%>, Min<$MIN_CHILDREN>, Max<$MAX_CHILDREN)> 
 Self (Ave<$SELF_AVERAGE%>, Min<$MIN_SELF>, Max<$MAX_SELF>)"

Here is my patch: