Re: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 29 Jul 2024 14:13:07 +0800

On Mon, Jul 29, 2024 at 2:04 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>
> > On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
> >>
> >> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >> >>
> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
> >> >>
> >> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >> >> >>
> >> >> >> Hi, Yafang,
> >> >> >>
> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
> >> >> >>
> >> >> >> > During my recent work to resolve latency spikes caused by zone->lock
> >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use
> >> >> >> > in practice.
> >> >> >>
> >> >> >> As we discussed before [1], I still feel confusing about the description
> >> >> >> about zone->lock contention.  How about change the description to
> >> >> >> something like,
> >> >> >
> >> >> > Sure, I will change it.
> >> >> >
> >> >> >>
> >> >> >> Larger page allocation/freeing batch number may cause longer run time of
> >> >> >> code holding zone->lock.  If zone->lock is heavily contended at the same
> >> >> >> time, latency spikes may occur even for casual page allocation/freeing.
> >> >> >> Although reducing the batch number cannot make zone->lock contended
> >> >> >> lighter, it can reduce the latency spikes effectively.
> >> >> >>
> >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> >> >> >>
> >> >> >> > To demonstrate this, I wrote a Python script:
> >> >> >> >
> >> >> >> >   import mmap
> >> >> >> >
> >> >> >> >   size = 6 * 1024**3
> >> >> >> >
> >> >> >> >   while True:
> >> >> >> >       mm = mmap.mmap(-1, size)
> >> >> >> >       mm[:] = b'\xff' * size
> >> >> >> >       mm.close()
> >> >> >> >
> >> >> >> > Run this script 10 times in parallel and measure the allocation latency by
> >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools
> >> >> >> > funclatency[1]:
> >> >> >> >
> >> >> >> >   funclatency -T -i 600 rmqueue_bulk
> >> >> >> >
> >> >> >> > Here are the results for both AMD and Intel CPUs.
> >> >> >> >
> >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
> >> >> >> > =====================================================================
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >> >        512 -> 1023       : 12       |                                        |
> >> >> >> >       1024 -> 2047       : 9116     |                                        |
> >> >> >> >       2048 -> 4095       : 2004     |                                        |
> >> >> >> >       4096 -> 8191       : 2497     |                                        |
> >> >> >> >       8192 -> 16383      : 2127     |                                        |
> >> >> >> >      16384 -> 32767      : 2483     |                                        |
> >> >> >> >      32768 -> 65535      : 10102    |                                        |
> >> >> >> >      65536 -> 131071     : 212730   |*******************                     |
> >> >> >> >     131072 -> 262143     : 314692   |*****************************           |
> >> >> >> >     262144 -> 524287     : 430058   |****************************************|
> >> >> >> >     524288 -> 1048575    : 224032   |********************                    |
> >> >> >> >    1048576 -> 2097151    : 73567    |******                                  |
> >> >> >> >    2097152 -> 4194303    : 17079    |*                                       |
> >> >> >> >    4194304 -> 8388607    : 3900     |                                        |
> >> >> >> >    8388608 -> 16777215   : 750      |                                        |
> >> >> >> >   16777216 -> 33554431   : 88       |                                        |
> >> >> >> >   33554432 -> 67108863   : 2        |                                        |
> >> >> >> >
> >> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242
> >> >> >> >
> >> >> >> > The avg alloc latency can be 449us, and the max latency can be higher
> >> >> >> > than 30ms.
> >> >> >> >
> >> >> >> > - Value set to 0
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >> >        512 -> 1023       : 92       |                                        |
> >> >> >> >       1024 -> 2047       : 8594     |                                        |
> >> >> >> >       2048 -> 4095       : 2042818  |******                                  |
> >> >> >> >       4096 -> 8191       : 8737624  |**************************              |
> >> >> >> >       8192 -> 16383      : 13147872 |****************************************|
> >> >> >> >      16384 -> 32767      : 8799951  |**************************              |
> >> >> >> >      32768 -> 65535      : 2879715  |********                                |
> >> >> >> >      65536 -> 131071     : 659600   |**                                      |
> >> >> >> >     131072 -> 262143     : 204004   |                                        |
> >> >> >> >     262144 -> 524287     : 78246    |                                        |
> >> >> >> >     524288 -> 1048575    : 30800    |                                        |
> >> >> >> >    1048576 -> 2097151    : 12251    |                                        |
> >> >> >> >    2097152 -> 4194303    : 2950     |                                        |
> >> >> >> >    4194304 -> 8388607    : 78       |                                        |
> >> >> >> >
> >> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636
> >> >> >> >
> >> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced
> >> >> >> > to less than 8ms.
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
> >> >> >> > latency. Latency-sensitive applications will benefit from this tuning.
> >> >> >> >
> >> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to
> >> >> >> > test it on different AMD models.
> >> >> >> >
> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
> >> >> >> > ============================================================
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 0        |                                        |
> >> >> >> >        512 -> 1023       : 2419     |                                        |
> >> >> >> >       1024 -> 2047       : 34499    |*                                       |
> >> >> >> >       2048 -> 4095       : 4272     |                                        |
> >> >> >> >       4096 -> 8191       : 9035     |                                        |
> >> >> >> >       8192 -> 16383      : 4374     |                                        |
> >> >> >> >      16384 -> 32767      : 2963     |                                        |
> >> >> >> >      32768 -> 65535      : 6407     |                                        |
> >> >> >> >      65536 -> 131071     : 884806   |****************************************|
> >> >> >> >     131072 -> 262143     : 145931   |******                                  |
> >> >> >> >     262144 -> 524287     : 13406    |                                        |
> >> >> >> >     524288 -> 1048575    : 1874     |                                        |
> >> >> >> >    1048576 -> 2097151    : 249      |                                        |
> >> >> >> >    2097152 -> 4194303    : 28       |                                        |
> >> >> >> >
> >> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > This Intel CPU works fine with the default setting.
> >> >> >> >
> >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
> >> >> >> > ==============================================================
> >> >> >> >
> >> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA
> >> >> >> > node 0 only.
> >> >> >> >
> >> >> >> > - Default value of 5
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 46       |                                        |
> >> >> >> >        512 -> 1023       : 695      |                                        |
> >> >> >> >       1024 -> 2047       : 19950    |*                                       |
> >> >> >> >       2048 -> 4095       : 1788     |                                        |
> >> >> >> >       4096 -> 8191       : 3392     |                                        |
> >> >> >> >       8192 -> 16383      : 2569     |                                        |
> >> >> >> >      16384 -> 32767      : 2619     |                                        |
> >> >> >> >      32768 -> 65535      : 3809     |                                        |
> >> >> >> >      65536 -> 131071     : 616182   |****************************************|
> >> >> >> >     131072 -> 262143     : 295587   |*******************                     |
> >> >> >> >     262144 -> 524287     : 75357    |****                                    |
> >> >> >> >     524288 -> 1048575    : 15471    |*                                       |
> >> >> >> >    1048576 -> 2097151    : 2939     |                                        |
> >> >> >> >    2097152 -> 4194303    : 243      |                                        |
> >> >> >> >    4194304 -> 8388607    : 3        |                                        |
> >> >> >> >
> >> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651
> >> >> >> >
> >> >> >> > The zone->lock contention becomes severe when there is only a single NUMA
> >> >> >> > node. The average latency is approximately 144us, with the maximum
> >> >> >> > latency exceeding 4ms.
> >> >> >> >
> >> >> >> > - Value set to 0
> >> >> >> >
> >> >> >> >      nsecs               : count     distribution
> >> >> >> >          0 -> 1          : 0        |                                        |
> >> >> >> >          2 -> 3          : 0        |                                        |
> >> >> >> >          4 -> 7          : 0        |                                        |
> >> >> >> >          8 -> 15         : 0        |                                        |
> >> >> >> >         16 -> 31         : 0        |                                        |
> >> >> >> >         32 -> 63         : 0        |                                        |
> >> >> >> >         64 -> 127        : 0        |                                        |
> >> >> >> >        128 -> 255        : 0        |                                        |
> >> >> >> >        256 -> 511        : 24       |                                        |
> >> >> >> >        512 -> 1023       : 2686     |                                        |
> >> >> >> >       1024 -> 2047       : 10246    |                                        |
> >> >> >> >       2048 -> 4095       : 4061529  |*********                               |
> >> >> >> >       4096 -> 8191       : 16894971 |****************************************|
> >> >> >> >       8192 -> 16383      : 6279310  |**************                          |
> >> >> >> >      16384 -> 32767      : 1658240  |***                                     |
> >> >> >> >      32768 -> 65535      : 445760   |*                                       |
> >> >> >> >      65536 -> 131071     : 110817   |                                        |
> >> >> >> >     131072 -> 262143     : 20279    |                                        |
> >> >> >> >     262144 -> 524287     : 4176     |                                        |
> >> >> >> >     524288 -> 1048575    : 436      |                                        |
> >> >> >> >    1048576 -> 2097151    : 8        |                                        |
> >> >> >> >    2097152 -> 4194303    : 2        |                                        |
> >> >> >> >
> >> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508
> >> >> >> >
> >> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the
> >> >> >> > max latency is less than 4ms.
> >> >> >> >
> >> >> >> > - Conclusion
> >> >> >> >
> >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive
> >> >> >> > applications work well with the default setting.
> >> >> >> >
> >> >> >> > It is worth noting that all the above data were tested using the upstream
> >> >> >> > kernel.
> >> >> >> >
> >> >> >> > Why introduce a systl knob?
> >> >> >> > ===========================
> >> >> >> >
> >> >> >> > From the above data, it's clear that different CPU types have varying
> >> >> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> >> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >> >> >
> >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> >> >> > setting for better throughput. In our production environment, we set this
> >> >> >> > value to 0 for applications running on Kubernetes servers while keeping it
> >> >> >> > at the default value of 5 for other applications like big data. It's not
> >> >> >> > common to release individual kernel packages for each application.
> >> >> >>
> >> >> >> Thanks for detailed performance data!
> >> >> >>
> >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in
> >> >> >> your environment?  If not, I suggest to use 0 as default for
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX.  Because we have clear evidence that
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads.  After
> >> >> >> that, if someone found some other workloads need larger
> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically.
> >> >> >>
> >> >> >
> >> >> > The decision doesn’t rest with us, the kernel team at our company.
> >> >> > It’s made by the system administrators who manage a large number of
> >> >> > servers. The latency spikes only occur on the Kubernetes (k8s)
> >> >> > servers, not in other environments like big data servers. We have
> >> >> > informed other system administrators, such as those managing the big
> >> >> > data servers, about the latency spike issues, but they are unwilling
> >> >> > to make the change.
> >> >> >
> >> >> > No one wants to make changes unless there is evidence showing that the
> >> >> > old settings will negatively impact them. However, as you know,
> >> >> > latency is not a critical concern for big data; throughput is more
> >> >> > important. If we keep the current settings, we will have to release
> >> >> > different kernel packages for different environments, which is a
> >> >> > significant burden for us.
> >> >>
> >> >> Totally understand your requirements.  And, I think that this is better
> >> >> to be resolved in your downstream kernel.  If there are clear evidences
> >> >> to prove small batch number hurts throughput for some workloads, we can
> >> >> make the change in the upstream kernel.
> >> >>
> >> >
> >> > Please don't make this more complicated. We are at an impasse.
> >> >
> >> > The key issue here is that the upstream kernel has a default value of
> >> > 5, not 0. If you can change it to 0, we can persuade our users to
> >> > follow the upstream changes. They currently set it to 5, not because
> >> > you, the author, chose this value, but because it is the default in
> >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide
> >> > support it. It's not just your decision as the author, but the entire
> >> > community supports this default.
> >> >
> >> > If, in the future, we find that the value of 0 is not suitable, you'll
> >> > tell us, "It is an issue in your downstream kernel, not in the
> >> > upstream kernel, so we won't accept it."  PANIC.
> >>
> >> I don't think so.  I suggest you to change the default value to 0.  If
> >> someone reported that his workloads need some other value, then we have
> >> evidence that different workloads need different value.  At that time,
> >> we can suggest to add an user tunable knob.
> >>
> >
> > The problem is that others are unaware we've set it to 0, and I can't
> > constantly monitor the linux-mm mailing list. Additionally, it's
> > possible that you can't always keep an eye on it either.
>
> IIUC, they will use the default value.  Then, if there is any
> performance regression, they can report it.

Now we report it. What is your replyment? "Keep it in your downstream
kernel." Wow, PANIC again.

>
> > I believe we should hear Andrew's suggestion. Andrew, what is your opinion?
>
> --
> Best Regards,
> Huang, Ying

--
Regards
Yafang