On Mon, Jul 29, 2024 at 2:04 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Yafang Shao <laoar.shao@xxxxxxxxx> writes: > > > On Mon, Jul 29, 2024 at 1:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> > >> > On Mon, Jul 29, 2024 at 1:16 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> > >> >> > On Mon, Jul 29, 2024 at 11:22 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> > >> >> >> Hi, Yafang, > >> >> >> > >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> > >> >> >> > During my recent work to resolve latency spikes caused by zone->lock > >> >> >> > contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use > >> >> >> > in practice. > >> >> >> > >> >> >> As we discussed before [1], I still feel confusing about the description > >> >> >> about zone->lock contention. How about change the description to > >> >> >> something like, > >> >> > > >> >> > Sure, I will change it. > >> >> > > >> >> >> > >> >> >> Larger page allocation/freeing batch number may cause longer run time of > >> >> >> code holding zone->lock. If zone->lock is heavily contended at the same > >> >> >> time, latency spikes may occur even for casual page allocation/freeing. > >> >> >> Although reducing the batch number cannot make zone->lock contended > >> >> >> lighter, it can reduce the latency spikes effectively. > >> >> >> > >> >> >> [1] https://lore.kernel.org/linux-mm/87ttgv8hlz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > >> >> >> > >> >> >> > To demonstrate this, I wrote a Python script: > >> >> >> > > >> >> >> > import mmap > >> >> >> > > >> >> >> > size = 6 * 1024**3 > >> >> >> > > >> >> >> > while True: > >> >> >> > mm = mmap.mmap(-1, size) > >> >> >> > mm[:] = b'\xff' * size > >> >> >> > mm.close() > >> >> >> > > >> >> >> > Run this script 10 times in parallel and measure the allocation latency by > >> >> >> > measuring the duration of rmqueue_bulk() with the BCC tools > >> >> >> > funclatency[1]: > >> >> >> > > >> >> >> > funclatency -T -i 600 rmqueue_bulk > >> >> >> > > >> >> >> > Here are the results for both AMD and Intel CPUs. > >> >> >> > > >> >> >> > AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server > >> >> >> > ===================================================================== > >> >> >> > > >> >> >> > - Default value of 5 > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | | > >> >> >> > 2 -> 3 : 0 | | > >> >> >> > 4 -> 7 : 0 | | > >> >> >> > 8 -> 15 : 0 | | > >> >> >> > 16 -> 31 : 0 | | > >> >> >> > 32 -> 63 : 0 | | > >> >> >> > 64 -> 127 : 0 | | > >> >> >> > 128 -> 255 : 0 | | > >> >> >> > 256 -> 511 : 0 | | > >> >> >> > 512 -> 1023 : 12 | | > >> >> >> > 1024 -> 2047 : 9116 | | > >> >> >> > 2048 -> 4095 : 2004 | | > >> >> >> > 4096 -> 8191 : 2497 | | > >> >> >> > 8192 -> 16383 : 2127 | | > >> >> >> > 16384 -> 32767 : 2483 | | > >> >> >> > 32768 -> 65535 : 10102 | | > >> >> >> > 65536 -> 131071 : 212730 |******************* | > >> >> >> > 131072 -> 262143 : 314692 |***************************** | > >> >> >> > 262144 -> 524287 : 430058 |****************************************| > >> >> >> > 524288 -> 1048575 : 224032 |******************** | > >> >> >> > 1048576 -> 2097151 : 73567 |****** | > >> >> >> > 2097152 -> 4194303 : 17079 |* | > >> >> >> > 4194304 -> 8388607 : 3900 | | > >> >> >> > 8388608 -> 16777215 : 750 | | > >> >> >> > 16777216 -> 33554431 : 88 | | > >> >> >> > 33554432 -> 67108863 : 2 | | > >> >> >> > > >> >> >> > avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242 > >> >> >> > > >> >> >> > The avg alloc latency can be 449us, and the max latency can be higher > >> >> >> > than 30ms. > >> >> >> > > >> >> >> > - Value set to 0 > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | | > >> >> >> > 2 -> 3 : 0 | | > >> >> >> > 4 -> 7 : 0 | | > >> >> >> > 8 -> 15 : 0 | | > >> >> >> > 16 -> 31 : 0 | | > >> >> >> > 32 -> 63 : 0 | | > >> >> >> > 64 -> 127 : 0 | | > >> >> >> > 128 -> 255 : 0 | | > >> >> >> > 256 -> 511 : 0 | | > >> >> >> > 512 -> 1023 : 92 | | > >> >> >> > 1024 -> 2047 : 8594 | | > >> >> >> > 2048 -> 4095 : 2042818 |****** | > >> >> >> > 4096 -> 8191 : 8737624 |************************** | > >> >> >> > 8192 -> 16383 : 13147872 |****************************************| > >> >> >> > 16384 -> 32767 : 8799951 |************************** | > >> >> >> > 32768 -> 65535 : 2879715 |******** | > >> >> >> > 65536 -> 131071 : 659600 |** | > >> >> >> > 131072 -> 262143 : 204004 | | > >> >> >> > 262144 -> 524287 : 78246 | | > >> >> >> > 524288 -> 1048575 : 30800 | | > >> >> >> > 1048576 -> 2097151 : 12251 | | > >> >> >> > 2097152 -> 4194303 : 2950 | | > >> >> >> > 4194304 -> 8388607 : 78 | | > >> >> >> > > >> >> >> > avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636 > >> >> >> > > >> >> >> > The avg was reduced significantly to 19us, and the max latency is reduced > >> >> >> > to less than 8ms. > >> >> >> > > >> >> >> > - Conclusion > >> >> >> > > >> >> >> > On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce > >> >> >> > latency. Latency-sensitive applications will benefit from this tuning. > >> >> >> > > >> >> >> > However, I don't have access to other types of AMD CPUs, so I was unable to > >> >> >> > test it on different AMD models. > >> >> >> > > >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes > >> >> >> > ============================================================ > >> >> >> > > >> >> >> > - Default value of 5 > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | | > >> >> >> > 2 -> 3 : 0 | | > >> >> >> > 4 -> 7 : 0 | | > >> >> >> > 8 -> 15 : 0 | | > >> >> >> > 16 -> 31 : 0 | | > >> >> >> > 32 -> 63 : 0 | | > >> >> >> > 64 -> 127 : 0 | | > >> >> >> > 128 -> 255 : 0 | | > >> >> >> > 256 -> 511 : 0 | | > >> >> >> > 512 -> 1023 : 2419 | | > >> >> >> > 1024 -> 2047 : 34499 |* | > >> >> >> > 2048 -> 4095 : 4272 | | > >> >> >> > 4096 -> 8191 : 9035 | | > >> >> >> > 8192 -> 16383 : 4374 | | > >> >> >> > 16384 -> 32767 : 2963 | | > >> >> >> > 32768 -> 65535 : 6407 | | > >> >> >> > 65536 -> 131071 : 884806 |****************************************| > >> >> >> > 131072 -> 262143 : 145931 |****** | > >> >> >> > 262144 -> 524287 : 13406 | | > >> >> >> > 524288 -> 1048575 : 1874 | | > >> >> >> > 1048576 -> 2097151 : 249 | | > >> >> >> > 2097152 -> 4194303 : 28 | | > >> >> >> > > >> >> >> > avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263 > >> >> >> > > >> >> >> > - Conclusion > >> >> >> > > >> >> >> > This Intel CPU works fine with the default setting. > >> >> >> > > >> >> >> > Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node > >> >> >> > ============================================================== > >> >> >> > > >> >> >> > Using the cpuset cgroup, we can restrict the test script to run on NUMA > >> >> >> > node 0 only. > >> >> >> > > >> >> >> > - Default value of 5 > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | | > >> >> >> > 2 -> 3 : 0 | | > >> >> >> > 4 -> 7 : 0 | | > >> >> >> > 8 -> 15 : 0 | | > >> >> >> > 16 -> 31 : 0 | | > >> >> >> > 32 -> 63 : 0 | | > >> >> >> > 64 -> 127 : 0 | | > >> >> >> > 128 -> 255 : 0 | | > >> >> >> > 256 -> 511 : 46 | | > >> >> >> > 512 -> 1023 : 695 | | > >> >> >> > 1024 -> 2047 : 19950 |* | > >> >> >> > 2048 -> 4095 : 1788 | | > >> >> >> > 4096 -> 8191 : 3392 | | > >> >> >> > 8192 -> 16383 : 2569 | | > >> >> >> > 16384 -> 32767 : 2619 | | > >> >> >> > 32768 -> 65535 : 3809 | | > >> >> >> > 65536 -> 131071 : 616182 |****************************************| > >> >> >> > 131072 -> 262143 : 295587 |******************* | > >> >> >> > 262144 -> 524287 : 75357 |**** | > >> >> >> > 524288 -> 1048575 : 15471 |* | > >> >> >> > 1048576 -> 2097151 : 2939 | | > >> >> >> > 2097152 -> 4194303 : 243 | | > >> >> >> > 4194304 -> 8388607 : 3 | | > >> >> >> > > >> >> >> > avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651 > >> >> >> > > >> >> >> > The zone->lock contention becomes severe when there is only a single NUMA > >> >> >> > node. The average latency is approximately 144us, with the maximum > >> >> >> > latency exceeding 4ms. > >> >> >> > > >> >> >> > - Value set to 0 > >> >> >> > > >> >> >> > nsecs : count distribution > >> >> >> > 0 -> 1 : 0 | | > >> >> >> > 2 -> 3 : 0 | | > >> >> >> > 4 -> 7 : 0 | | > >> >> >> > 8 -> 15 : 0 | | > >> >> >> > 16 -> 31 : 0 | | > >> >> >> > 32 -> 63 : 0 | | > >> >> >> > 64 -> 127 : 0 | | > >> >> >> > 128 -> 255 : 0 | | > >> >> >> > 256 -> 511 : 24 | | > >> >> >> > 512 -> 1023 : 2686 | | > >> >> >> > 1024 -> 2047 : 10246 | | > >> >> >> > 2048 -> 4095 : 4061529 |********* | > >> >> >> > 4096 -> 8191 : 16894971 |****************************************| > >> >> >> > 8192 -> 16383 : 6279310 |************** | > >> >> >> > 16384 -> 32767 : 1658240 |*** | > >> >> >> > 32768 -> 65535 : 445760 |* | > >> >> >> > 65536 -> 131071 : 110817 | | > >> >> >> > 131072 -> 262143 : 20279 | | > >> >> >> > 262144 -> 524287 : 4176 | | > >> >> >> > 524288 -> 1048575 : 436 | | > >> >> >> > 1048576 -> 2097151 : 8 | | > >> >> >> > 2097152 -> 4194303 : 2 | | > >> >> >> > > >> >> >> > avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508 > >> >> >> > > >> >> >> > After setting it to 0, the avg latency is reduced to around 8us, and the > >> >> >> > max latency is less than 4ms. > >> >> >> > > >> >> >> > - Conclusion > >> >> >> > > >> >> >> > On this Intel CPU, this tuning doesn't help much. Latency-sensitive > >> >> >> > applications work well with the default setting. > >> >> >> > > >> >> >> > It is worth noting that all the above data were tested using the upstream > >> >> >> > kernel. > >> >> >> > > >> >> >> > Why introduce a systl knob? > >> >> >> > =========================== > >> >> >> > > >> >> >> > From the above data, it's clear that different CPU types have varying > >> >> >> > allocation latencies concerning zone->lock contention. Typically, people > >> >> >> > don't release individual kernel packages for each type of x86_64 CPU. > >> >> >> > > >> >> >> > Furthermore, for latency-insensitive applications, we can keep the default > >> >> >> > setting for better throughput. In our production environment, we set this > >> >> >> > value to 0 for applications running on Kubernetes servers while keeping it > >> >> >> > at the default value of 5 for other applications like big data. It's not > >> >> >> > common to release individual kernel packages for each application. > >> >> >> > >> >> >> Thanks for detailed performance data! > >> >> >> > >> >> >> Is there any downside observed to set CONFIG_PCP_BATCH_SCALE_MAX to 0 in > >> >> >> your environment? If not, I suggest to use 0 as default for > >> >> >> CONFIG_PCP_BATCH_SCALE_MAX. Because we have clear evidence that > >> >> >> CONFIG_PCP_BATCH_SCALE_MAX hurts latency for some workloads. After > >> >> >> that, if someone found some other workloads need larger > >> >> >> CONFIG_PCP_BATCH_SCALE_MAX, we can make it tunable dynamically. > >> >> >> > >> >> > > >> >> > The decision doesn’t rest with us, the kernel team at our company. > >> >> > It’s made by the system administrators who manage a large number of > >> >> > servers. The latency spikes only occur on the Kubernetes (k8s) > >> >> > servers, not in other environments like big data servers. We have > >> >> > informed other system administrators, such as those managing the big > >> >> > data servers, about the latency spike issues, but they are unwilling > >> >> > to make the change. > >> >> > > >> >> > No one wants to make changes unless there is evidence showing that the > >> >> > old settings will negatively impact them. However, as you know, > >> >> > latency is not a critical concern for big data; throughput is more > >> >> > important. If we keep the current settings, we will have to release > >> >> > different kernel packages for different environments, which is a > >> >> > significant burden for us. > >> >> > >> >> Totally understand your requirements. And, I think that this is better > >> >> to be resolved in your downstream kernel. If there are clear evidences > >> >> to prove small batch number hurts throughput for some workloads, we can > >> >> make the change in the upstream kernel. > >> >> > >> > > >> > Please don't make this more complicated. We are at an impasse. > >> > > >> > The key issue here is that the upstream kernel has a default value of > >> > 5, not 0. If you can change it to 0, we can persuade our users to > >> > follow the upstream changes. They currently set it to 5, not because > >> > you, the author, chose this value, but because it is the default in > >> > Linus's tree. Since it's in Linus's tree, kernel developers worldwide > >> > support it. It's not just your decision as the author, but the entire > >> > community supports this default. > >> > > >> > If, in the future, we find that the value of 0 is not suitable, you'll > >> > tell us, "It is an issue in your downstream kernel, not in the > >> > upstream kernel, so we won't accept it." PANIC. > >> > >> I don't think so. I suggest you to change the default value to 0. If > >> someone reported that his workloads need some other value, then we have > >> evidence that different workloads need different value. At that time, > >> we can suggest to add an user tunable knob. > >> > > > > The problem is that others are unaware we've set it to 0, and I can't > > constantly monitor the linux-mm mailing list. Additionally, it's > > possible that you can't always keep an eye on it either. > > IIUC, they will use the default value. Then, if there is any > performance regression, they can report it. Now we report it. What is your replyment? "Keep it in your downstream kernel." Wow, PANIC again. > > > I believe we should hear Andrew's suggestion. Andrew, what is your opinion? > > -- > Best Regards, > Huang, Ying -- Regards Yafang