Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

"Huang, Ying" <ying.huang@xxxxxxxxx> · Mon, 29 Jul 2024 14:14:54 +0800

Yafang Shao <laoar.shao@xxxxxxxxx> writes:

> On Mon, Jul 29, 2024 at 2:04 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>>
>> > On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >>
>> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> >>
>> >> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >> >>
>> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> >> >>
>> >> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >> >> >>
>> >> >> >> Hi, Yafang,
>> >> >> >>
>> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> >> >> >>
>> >> >> >> > During my recent work to resolve latency spikes caused by zone->lock
>> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
>> >> >> >> > in practice.
>> >> >> >>
>> >> >> >> As we discussed before [1], I still feel confusing about the description
>> >> >> >> about zone->lock contention.  How about change the description to
>> >> >> >> something like,
>> >> >> >
>> >> >> > Sure, I will change it.
>> >> >> >
>> >> >> >>
>> >> >> >> Larger page allocation/freeing batch number may cause longer run time of
>> >> >> >> code holding zone->lock.  If zone->lock is heavily contended at the same
>> >> >> >> time, latency spikes may occur even for casual page allocation/freeing.
>> >> >> >> Although reducing the batch number cannot make zone->lock contended
>> >> >> >> lighter, it can reduce the latency spikes effectively.
>> >> >> >>
>> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>> >> >> >>
>> >> >> >> > To demonstrate this, I wrote a Python script:
>> >> >> >> >
>> >> >> >> >   import mmap
>> >> >> >> >
>> >> >> >> >   size = 6 * 1024**3
>> >> >> >> >
>> >> >> >> >   while True:
>> >> >> >> >       mm = mmap.mmap(-1, size)
>> >> >> >> >       mm[:] = b'\xff' * size
>> >> >> >> >       mm.close()
>> >> >> >> >
>> >> >> >> > Run this script 10 times in parallel and measure the allocation latency by
>> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
>> >> >> >> > funclatency[1]:
>> >> >> >> >
>> >> >> >> >   funclatency -T -i 600 rmqueue_bulk
>> >> >> >> >
>> >> >> >> > Here are the results for both AMD and Intel CPUs.
>> >> >> >> >
>> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
>> >> >> >> > =====================================================================
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >> >        512 -> 1023       : 12       |                                        |
>> >> >> >> >       1024 -> 2047       : 9116     |                                        |
>> >> >> >> >       2048 -> 4095       : 2004     |                                        |
>> >> >> >> >       4096 -> 8191       : 2497     |                                        |
>> >> >> >> >       8192 -> 16383      : 2127     |                                        |
>> >> >> >> >      16384 -> 32767      : 2483     |                                        |
>> >> >> >> >      32768 -> 65535      : 10102    |                                        |
>> >> >> >> >      65536 -> 131071     : 212730   |*******************                     |
>> >> >> >> >     131072 -> 262143     : 314692   |*****************************           |
>> >> >> >> >     262144 -> 524287     : 430058   |****************************************|
>> >> >> >> >     524288 -> 1048575    : 224032   |********************                    |
>> >> >> >> >    1048576 -> 2097151    : 73567    |******                                  |
>> >> >> >> >    2097152 -> 4194303    : 17079    |*                                       |
>> >> >> >> >    4194304 -> 8388607    : 3900     |                                        |
>> >> >> >> >    8388608 -> 16777215   : 750      |                                        |
>> >> >> >> >   16777216 -> 33554431   : 88       |                                        |
>> >> >> >> >   33554432 -> 67108863   : 2        |                                        |
>> >> >> >> >
>> >> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
>> >> >> >> >
>> >> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
>> >> >> >> > than 30ms.
>> >> >> >> >
>> >> >> >> > - Value set to 0
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >> >        512 -> 1023       : 92       |                                        |
>> >> >> >> >       1024 -> 2047       : 8594     |                                        |
>> >> >> >> >       2048 -> 4095       : 2042818  |******                                  |
>> >> >> >> >       4096 -> 8191       : 8737624  |**************************              |
>> >> >> >> >       8192 -> 16383      : 13147872 |****************************************|
>> >> >> >> >      16384 -> 32767      : 8799951  |**************************              |
>> >> >> >> >      32768 -> 65535      : 2879715  |********                                |
>> >> >> >> >      65536 -> 131071     : 659600   |**                                      |
>> >> >> >> >     131072 -> 262143     : 204004   |                                        |
>> >> >> >> >     262144 -> 524287     : 78246    |                                        |
>> >> >> >> >     524288 -> 1048575    : 30800    |                                        |
>> >> >> >> >    1048576 -> 2097151    : 12251    |                                        |
>> >> >> >> >    2097152 -> 4194303    : 2950     |                                        |
>> >> >> >> >    4194304 -> 8388607    : 78       |                                        |
>> >> >> >> >
>> >> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
>> >> >> >> >
>> >> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
>> >> >> >> > to less than 8ms.
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
>> >> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
>> >> >> >> >
>> >> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
>> >> >> >> > test it on different AMD models.
>> >> >> >> >
>> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
>> >> >> >> > ============================================================
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 0        |                                        |
>> >> >> >> >        512 -> 1023       : 2419     |                                        |
>> >> >> >> >       1024 -> 2047       : 34499    |*                                       |
>> >> >> >> >       2048 -> 4095       : 4272     |                                        |
>> >> >> >> >       4096 -> 8191       : 9035     |                                        |
>> >> >> >> >       8192 -> 16383      : 4374     |                                        |
>> >> >> >> >      16384 -> 32767      : 2963     |                                        |
>> >> >> >> >      32768 -> 65535      : 6407     |                                        |
>> >> >> >> >      65536 -> 131071     : 884806   |****************************************|
>> >> >> >> >     131072 -> 262143     : 145931   |******                                  |
>> >> >> >> >     262144 -> 524287     : 13406    |                                        |
>> >> >> >> >     524288 -> 1048575    : 1874     |                                        |
>> >> >> >> >    1048576 -> 2097151    : 249      |                                        |
>> >> >> >> >    2097152 -> 4194303    : 28       |                                        |
>> >> >> >> >
>> >> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > This Intel CPU works fine with the default setting.
>> >> >> >> >
>> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
>> >> >> >> > ==============================================================
>> >> >> >> >
>> >> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
>> >> >> >> > node 0 only.
>> >> >> >> >
>> >> >> >> > - Default value of 5
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 46       |                                        |
>> >> >> >> >        512 -> 1023       : 695      |                                        |
>> >> >> >> >       1024 -> 2047       : 19950    |*                                       |
>> >> >> >> >       2048 -> 4095       : 1788     |                                        |
>> >> >> >> >       4096 -> 8191       : 3392     |                                        |
>> >> >> >> >       8192 -> 16383      : 2569     |                                        |
>> >> >> >> >      16384 -> 32767      : 2619     |                                        |
>> >> >> >> >      32768 -> 65535      : 3809     |                                        |
>> >> >> >> >      65536 -> 131071     : 616182   |****************************************|
>> >> >> >> >     131072 -> 262143     : 295587   |*******************                     |
>> >> >> >> >     262144 -> 524287     : 75357    |****                                    |
>> >> >> >> >     524288 -> 1048575    : 15471    |*                                       |
>> >> >> >> >    1048576 -> 2097151    : 2939     |                                        |
>> >> >> >> >    2097152 -> 4194303    : 243      |                                        |
>> >> >> >> >    4194304 -> 8388607    : 3        |                                        |
>> >> >> >> >
>> >> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
>> >> >> >> >
>> >> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
>> >> >> >> > node. The average latency is approximately 144us, with the maximum
>> >> >> >> > latency exceeding 4ms.
>> >> >> >> >
>> >> >> >> > - Value set to 0
>> >> >> >> >
>> >> >> >> >      nsecs               : count     distribution
>> >> >> >> >          0 -> 1          : 0        |                                        |
>> >> >> >> >          2 -> 3          : 0        |                                        |
>> >> >> >> >          4 -> 7          : 0        |                                        |
>> >> >> >> >          8 -> 15         : 0        |                                        |
>> >> >> >> >         16 -> 31         : 0        |                                        |
>> >> >> >> >         32 -> 63         : 0        |                                        |
>> >> >> >> >         64 -> 127        : 0        |                                        |
>> >> >> >> >        128 -> 255        : 0        |                                        |
>> >> >> >> >        256 -> 511        : 24       |                                        |
>> >> >> >> >        512 -> 1023       : 2686     |                                        |
>> >> >> >> >       1024 -> 2047       : 10246    |                                        |
>> >> >> >> >       2048 -> 4095       : 4061529  |*********                               |
>> >> >> >> >       4096 -> 8191       : 16894971 |****************************************|
>> >> >> >> >       8192 -> 16383      : 6279310  |**************                          |
>> >> >> >> >      16384 -> 32767      : 1658240  |***                                     |
>> >> >> >> >      32768 -> 65535      : 445760   |*                                       |
>> >> >> >> >      65536 -> 131071     : 110817   |                                        |
>> >> >> >> >     131072 -> 262143     : 20279    |                                        |
>> >> >> >> >     262144 -> 524287     : 4176     |                                        |
>> >> >> >> >     524288 -> 1048575    : 436      |                                        |
>> >> >> >> >    1048576 -> 2097151    : 8        |                                        |
>> >> >> >> >    2097152 -> 4194303    : 2        |                                        |
>> >> >> >> >
>> >> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
>> >> >> >> >
>> >> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
>> >> >> >> > max latency is less than 4ms.
>> >> >> >> >
>> >> >> >> > - Conclusion
>> >> >> >> >
>> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
>> >> >> >> > applications work well with the default setting.
>> >> >> >> >
>> >> >> >> > It is worth noting that all the above data were tested using the upstream
>> >> >> >> > kernel.
>> >> >> >> >
>> >> >> >> > Why introduce a systl knob?
>> >> >> >> > ===========================
>> >> >> >> >
>> >> >> >> > From the above data, it's clear that different CPU types have varying
>> >> >> >> > allocation latencies concerning zone->lock contention. Typically, people
>> >> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
>> >> >> >> >
>> >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
>> >> >> >> > setting for better throughput. In our production environment, we set this
>> >> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
>> >> >> >> > at the default value of 5 for other applications like big data. It's not
>> >> >> >> > common to release individual kernel packages for each application.
>> >> >> >>
>> >> >> >> Thanks for detailed performance data!
>> >> >> >>
>> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
>> >> >> >> your environment?  If not, I suggest to use 0 as default for
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  After
>> >> >> >> that, if someone found some other workloads need larger
>> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
>> >> >> >>
>> >> >> >
>> >> >> > The decision doesn’t rest with us, the kernel team at our company.
>> >> >> > It’s made by the system administrators who manage a large number of
>> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
>> >> >> > servers, not in other environments like big data servers. We have
>> >> >> > informed other system administrators, such as those managing the big
>> >> >> > data servers, about the latency spike issues, but they are unwilling
>> >> >> > to make the change.
>> >> >> >
>> >> >> > No one wants to make changes unless there is evidence showing that the
>> >> >> > old settings will negatively impact them. However, as you know,
>> >> >> > latency is not a critical concern for big data; throughput is more
>> >> >> > important. If we keep the current settings, we will have to release
>> >> >> > different kernel packages for different environments, which is a
>> >> >> > significant burden for us.
>> >> >>
>> >> >> Totally understand your requirements.  And, I think that this is better
>> >> >> to be resolved in your downstream kernel.  If there are clear evidences
>> >> >> to prove small batch number hurts throughput for some workloads, we can
>> >> >> make the change in the upstream kernel.
>> >> >>
>> >> >
>> >> > Please don't make this more complicated. We are at an impasse.
>> >> >
>> >> > The key issue here is that the upstream kernel has a default value of
>> >> > 5, not 0. If you can change it to 0, we can persuade our users to
>> >> > follow the upstream changes. They currently set it to 5, not because
>> >> > you, the author, chose this value, but because it is the default in
>> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
>> >> > support it. It's not just your decision as the author, but the entire
>> >> > community supports this default.
>> >> >
>> >> > If, in the future, we find that the value of 0 is not suitable, you'll
>> >> > tell us, "It is an issue in your downstream kernel, not in the
>> >> > upstream kernel, so we won't accept it."  PANIC.
>> >>
>> >> I don't think so.  I suggest you to change the default value to 0.  If
>> >> someone reported that his workloads need some other value, then we have
>> >> evidence that different workloads need different value.  At that time,
>> >> we can suggest to add an user tunable knob.
>> >>
>> >
>> > The problem is that others are unaware we've set it to 0, and I can't
>> > constantly monitor the linux-mm mailing list. Additionally, it's
>> > possible that you can't always keep an eye on it either.
>>
>> IIUC, they will use the default value.  Then, if there is any
>> performance regression, they can report it.
>
> Now we report it. What is your replyment? "Keep it in your downstream
> kernel." Wow, PANIC again.

This is not all of my reply.  I suggested you to change the default
value too.

>
>>
>> > I believe we should hear Andrew's suggestion. Andrew, what is your opinion?
>>

--
Best Regards,
Huang, Ying