[PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Yafang Shao <laoar.shao@xxxxxxxxx> · Sun, 4 Aug 2024 16:01:07 +0800

Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock.  If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.

To demonstrate this, I wrote a Python script:

  import mmap

  size = 6 * 1024**3

  while True:
      mm = mmap.mmap(-1, size)
      mm[:] = b'\xff' * size
      mm.close()

Run this script 10 times in parallel and measure the allocation latency by
measuring the duration of rmqueue_bulk() with the BCC tools
funclatency[0]:

  funclatency -T -i 600 rmqueue_bulk

Here are the results for both AMD and Intel CPUs.

AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
=====================================================================

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 12       |                                        |
      1024 -> 2047       : 9116     |                                        |
      2048 -> 4095       : 2004     |                                        |
      4096 -> 8191       : 2497     |                                        |
      8192 -> 16383      : 2127     |                                        |
     16384 -> 32767      : 2483     |                                        |
     32768 -> 65535      : 10102    |                                        |
     65536 -> 131071     : 212730   |*******************                     |
    131072 -> 262143     : 314692   |*****************************           |
    262144 -> 524287     : 430058   |****************************************|
    524288 -> 1048575    : 224032   |********************                    |
   1048576 -> 2097151    : 73567    |******                                  |
   2097152 -> 4194303    : 17079    |*                                       |
   4194304 -> 8388607    : 3900     |                                        |
   8388608 -> 16777215   : 750      |                                        |
  16777216 -> 33554431   : 88       |                                        |
  33554432 -> 67108863   : 2        |                                        |

avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242

The avg alloc latency can be 449us, and the max latency can be higher
than 30ms.

- Value set to 0

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 92       |                                        |
      1024 -> 2047       : 8594     |                                        |
      2048 -> 4095       : 2042818  |******                                  |
      4096 -> 8191       : 8737624  |**************************              |
      8192 -> 16383      : 13147872 |****************************************|
     16384 -> 32767      : 8799951  |**************************              |
     32768 -> 65535      : 2879715  |********                                |
     65536 -> 131071     : 659600   |**                                      |
    131072 -> 262143     : 204004   |                                        |
    262144 -> 524287     : 78246    |                                        |
    524288 -> 1048575    : 30800    |                                        |
   1048576 -> 2097151    : 12251    |                                        |
   2097152 -> 4194303    : 2950     |                                        |
   4194304 -> 8388607    : 78       |                                        |

avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636

The avg was reduced significantly to 19us, and the max latency is reduced
to less than 8ms.

- Conclusion

On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
latency. Latency-sensitive applications will benefit from this tuning.

However, I don't have access to other types of AMD CPUs, so I was unable to
test it on different AMD models.

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
============================================================

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 2419     |                                        |
      1024 -> 2047       : 34499    |*                                       |
      2048 -> 4095       : 4272     |                                        |
      4096 -> 8191       : 9035     |                                        |
      8192 -> 16383      : 4374     |                                        |
     16384 -> 32767      : 2963     |                                        |
     32768 -> 65535      : 6407     |                                        |
     65536 -> 131071     : 884806   |****************************************|
    131072 -> 262143     : 145931   |******                                  |
    262144 -> 524287     : 13406    |                                        |
    524288 -> 1048575    : 1874     |                                        |
   1048576 -> 2097151    : 249      |                                        |
   2097152 -> 4194303    : 28       |                                        |

avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263

- Conclusion

This Intel CPU works fine with the default setting.

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
==============================================================

Using the cpuset cgroup, we can restrict the test script to run on NUMA
node 0 only.

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 46       |                                        |
       512 -> 1023       : 695      |                                        |
      1024 -> 2047       : 19950    |*                                       |
      2048 -> 4095       : 1788     |                                        |
      4096 -> 8191       : 3392     |                                        |
      8192 -> 16383      : 2569     |                                        |
     16384 -> 32767      : 2619     |                                        |
     32768 -> 65535      : 3809     |                                        |
     65536 -> 131071     : 616182   |****************************************|
    131072 -> 262143     : 295587   |*******************                     |
    262144 -> 524287     : 75357    |****                                    |
    524288 -> 1048575    : 15471    |*                                       |
   1048576 -> 2097151    : 2939     |                                        |
   2097152 -> 4194303    : 243      |                                        |
   4194304 -> 8388607    : 3        |                                        |

avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651

The zone->lock contention becomes severe when there is only a single NUMA
node. The average latency is approximately 144us, with the maximum
latency exceeding 4ms.

- Value set to 0

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 24       |                                        |
       512 -> 1023       : 2686     |                                        |
      1024 -> 2047       : 10246    |                                        |
      2048 -> 4095       : 4061529  |*********                               |
      4096 -> 8191       : 16894971 |****************************************|
      8192 -> 16383      : 6279310  |**************                          |
     16384 -> 32767      : 1658240  |***                                     |
     32768 -> 65535      : 445760   |*                                       |
     65536 -> 131071     : 110817   |                                        |
    131072 -> 262143     : 20279    |                                        |
    262144 -> 524287     : 4176     |                                        |
    524288 -> 1048575    : 436      |                                        |
   1048576 -> 2097151    : 8        |                                        |
   2097152 -> 4194303    : 2        |                                        |

avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508

After setting it to 0, the avg latency is reduced to around 8us, and the
max latency is less than 4ms.

- Conclusion

On this Intel CPU, this tuning doesn't help much. Latency-sensitive
applications work well with the default setting.

It is worth noting that all the above data were tested using the upstream
kernel.

Why introduce a systl knob?
===========================