Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

"Huang, Ying" <ying.huang@xxxxxxxxx> · Mon, 29 Jul 2024 11:18:48 +0800

Hi, Yafang,

Yafang Shao <laoar.shao@xxxxxxxxx> writes:

> During my recent work to resolve latency spikes caused by zone->lock
> contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> in practice.

As we discussed before [1], I still feel confusing about the description
about zone->lock contention.  How about change the description to
something like,

Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock.  If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.

[1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

> To demonstrate this, I wrote a Python script:
>
>   import mmap
>
>   size = 6 * 1024**3
>
>   while True:
>       mm = mmap.mmap(-1, size)
>       mm[:] = b'\xff' * size
>       mm.close()
>
> Run this script 10 times in parallel and measure the allocation latency by
> measuring the duration of rmqueue_bulk() with the BCC tools
> funclatency[1]:
>
>   funclatency -T -i 600 rmqueue_bulk
>
> Here are the results for both AMD and Intel CPUs.
>
> AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> =====================================================================
>
> - Default value of 5
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 12       |                                        |
>       1024 -> 2047       : 9116     |                                        |
>       2048 -> 4095       : 2004     |                                        |
>       4096 -> 8191       : 2497     |                                        |
>       8192 -> 16383      : 2127     |                                        |
>      16384 -> 32767      : 2483     |                                        |
>      32768 -> 65535      : 10102    |                                        |
>      65536 -> 131071     : 212730   |*******************                     |
>     131072 -> 262143     : 314692   |*****************************           |
>     262144 -> 524287     : 430058   |****************************************|
>     524288 -> 1048575    : 224032   |********************                    |
>    1048576 -> 2097151    : 73567    |******                                  |
>    2097152 -> 4194303    : 17079    |*                                       |
>    4194304 -> 8388607    : 3900     |                                        |
>    8388608 -> 16777215   : 750      |                                        |
>   16777216 -> 33554431   : 88       |                                        |
>   33554432 -> 67108863   : 2        |                                        |
>
> avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>
> The avg alloc latency can be 449us, and the max latency can be higher
> than 30ms.
>
> - Value set to 0
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 92       |                                        |
>       1024 -> 2047       : 8594     |                                        |
>       2048 -> 4095       : 2042818  |******                                  |
>       4096 -> 8191       : 8737624  |**************************              |
>       8192 -> 16383      : 13147872 |****************************************|
>      16384 -> 32767      : 8799951  |**************************              |
>      32768 -> 65535      : 2879715  |********                                |
>      65536 -> 131071     : 659600   |**                                      |
>     131072 -> 262143     : 204004   |                                        |
>     262144 -> 524287     : 78246    |                                        |
>     524288 -> 1048575    : 30800    |                                        |
>    1048576 -> 2097151    : 12251    |                                        |
>    2097152 -> 4194303    : 2950     |                                        |
>    4194304 -> 8388607    : 78       |                                        |
>
> avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>
> The avg was reduced significantly to 19us, and the max latency is reduced
> to less than 8ms.
>
> - Conclusion
>
> On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> latency. Latency-sensitive applications will benefit from this tuning.
>
> However, I don't have access to other types of AMD CPUs, so I was unable to
> test it on different AMD models.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> ============================================================
>
> - Default value of 5
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 0        |                                        |
>        512 -> 1023       : 2419     |                                        |
>       1024 -> 2047       : 34499    |*                                       |
>       2048 -> 4095       : 4272     |                                        |
>       4096 -> 8191       : 9035     |                                        |
>       8192 -> 16383      : 4374     |                                        |
>      16384 -> 32767      : 2963     |                                        |
>      32768 -> 65535      : 6407     |                                        |
>      65536 -> 131071     : 884806   |****************************************|
>     131072 -> 262143     : 145931   |******                                  |
>     262144 -> 524287     : 13406    |                                        |
>     524288 -> 1048575    : 1874     |                                        |
>    1048576 -> 2097151    : 249      |                                        |
>    2097152 -> 4194303    : 28       |                                        |
>
> avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>
> - Conclusion
>
> This Intel CPU works fine with the default setting.
>
> Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> ==============================================================
>
> Using the cpuset cgroup, we can restrict the test script to run on NUMA
> node 0 only.
>
> - Default value of 5
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 46       |                                        |
>        512 -> 1023       : 695      |                                        |
>       1024 -> 2047       : 19950    |*                                       |
>       2048 -> 4095       : 1788     |                                        |
>       4096 -> 8191       : 3392     |                                        |
>       8192 -> 16383      : 2569     |                                        |
>      16384 -> 32767      : 2619     |                                        |
>      32768 -> 65535      : 3809     |                                        |
>      65536 -> 131071     : 616182   |****************************************|
>     131072 -> 262143     : 295587   |*******************                     |
>     262144 -> 524287     : 75357    |****                                    |
>     524288 -> 1048575    : 15471    |*                                       |
>    1048576 -> 2097151    : 2939     |                                        |
>    2097152 -> 4194303    : 243      |                                        |
>    4194304 -> 8388607    : 3        |                                        |
>
> avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>
> The zone->lock contention becomes severe when there is only a single NUMA
> node. The average latency is approximately 144us, with the maximum
> latency exceeding 4ms.
>
> - Value set to 0
>
>      nsecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 0        |                                        |
>        128 -> 255        : 0        |                                        |
>        256 -> 511        : 24       |                                        |
>        512 -> 1023       : 2686     |                                        |
>       1024 -> 2047       : 10246    |                                        |
>       2048 -> 4095       : 4061529  |*********                               |
>       4096 -> 8191       : 16894971 |****************************************|
>       8192 -> 16383      : 6279310  |**************                          |
>      16384 -> 32767      : 1658240  |***                                     |
>      32768 -> 65535      : 445760   |*                                       |
>      65536 -> 131071     : 110817   |                                        |
>     131072 -> 262143     : 20279    |                                        |
>     262144 -> 524287     : 4176     |                                        |
>     524288 -> 1048575    : 436      |                                        |
>    1048576 -> 2097151    : 8        |                                        |
>    2097152 -> 4194303    : 2        |                                        |
>
> avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>
> After setting it to 0, the avg latency is reduced to around 8us, and the
> max latency is less than 4ms.
>
> - Conclusion
>
> On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> applications work well with the default setting.
>
> It is worth noting that all the above data were tested using the upstream
> kernel.
>
> Why introduce a systl knob?
> ===========================
>
> From the above data, it's clear that different CPU types have varying
> allocation latencies concerning zone->lock contention. Typically, people
> don't release individual kernel packages for each type of x86_64 CPU.
>
> Furthermore, for latency-insensitive applications, we can keep the default
> setting for better throughput. In our production environment, we set this
> value to 0 for applications running on Kubernetes servers while keeping it
> at the default value of 5 for other applications like big data. It's not
> common to release individual kernel packages for each application.

Thanks for detailed performance data!

Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
your environment?  If not, I suggest to use 0 as default for
CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  After
that, if someone found some other workloads need larger
CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.

[snip]

--
Best Regards,
Huang, Ying