On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Yafang Shao <laoar.shao@xxxxxxxxx> writes: > > > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> > >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> > >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> > >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> > >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> > >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> >> > >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> >> > >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> >> >> > >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> >> >> > >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> >> >> >> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes: > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for > >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment, > >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the > >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max > >> >> >> >> >> >> >> >> > is introduced as a more practical alternative. > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> In general, I'm neutral to the change. I can understand that kernel > >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob. But, sysctl knob is ABI > >> >> >> >> >> >> >> >> too. > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions > >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi > >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting > >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away > >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a > >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is > >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort. > >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock > >> >> >> >> >> >> >> >> contention. Instead, it will reduce page allocation/freeing latency. > >> >> >> >> >> >> >> > > >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a > >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have > >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly > >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale > >> >> >> >> >> >> >> > behind introducing this configuration in the beginning? > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Sorry, I don't understand your words. Do you need me to explain what is > >> >> >> >> >> >> >> "neutral"? > >> >> >> >> >> >> > > >> >> >> >> >> >> > No, thanks. > >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive > >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better > >> >> >> >> >> >> > understanding of the concept. > >> >> >> >> >> >> > > >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ? > >> >> >> >> >> >> > >> >> >> >> >> >> I think that I have explained it in the commit log of commit > >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long > >> >> >> >> >> >> latency"). Which introduces the config. > >> >> >> >> >> > > >> >> >> >> >> > What specifically are your expectations for how users should utilize > >> >> >> >> >> > this config in real production workload? > >> >> >> >> >> > > >> >> >> >> >> >> > >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever. Can you > >> >> >> >> >> >> explain why you need it? Why cannot you use a fixed value after initial > >> >> >> >> >> >> experiments. > >> >> >> >> >> > > >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds > >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we > >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the > >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it > >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every > >> >> >> >> >> > instance where the default value falls short? Surely, there must be > >> >> >> >> >> > more practical and efficient approaches we can explore together to > >> >> >> >> >> > ensure optimal performance across all workloads. > >> >> >> >> >> > > >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are > >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also > >> >> >> >> >> > consider the needs and requirements of our actual users, along with > >> >> >> >> >> > the diverse workloads they encounter in their daily operations. > >> >> >> >> >> > >> >> >> >> >> Have you found that your different systems requires different > >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already? > >> >> >> >> > > >> >> >> >> > For specific workloads that introduce latency, we set the value to 0. > >> >> >> >> > For other workloads, we keep it unchanged until we determine that the > >> >> >> >> > default value is also suboptimal. What is the issue with this > >> >> >> >> > approach? > >> >> >> >> > >> >> >> >> Firstly, this is a system wide configuration, not workload specific. > >> >> >> >> So, other workloads run on the same system will be impacted too. Will > >> >> >> >> you run one workload only on one system? > >> >> >> > > >> >> >> > It seems we're living on different planets. You're happily working in > >> >> >> > your lab environment, while I'm struggling with real-world production > >> >> >> > issues. > >> >> >> > > >> >> >> > For servers: > >> >> >> > > >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> > Server 1,000,001 and beyond: Happy with all values > >> >> >> > > >> >> >> > Is this hard to understand? > >> >> >> > > >> >> >> > In other words: > >> >> >> > > >> >> >> > For applications: > >> >> >> > > >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0 > >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5 > >> >> >> > Application 1,000,001 and beyond: Happy with all values > >> >> >> > >> >> >> Good to know this. Thanks! > >> >> >> > >> >> >> >> > >> >> >> >> Secondly, we need some evidences to introduce a new system ABI. For > >> >> >> >> example, we need to use different configuration on different systems > >> >> >> >> otherwise some workloads will be hurt. Can you provide some evidences > >> >> >> >> to support your change? IMHO, it's not good enough to say I don't know > >> >> >> >> why I just don't want to change existing systems. If so, it may be > >> >> >> >> better to wait until you have more evidences. > >> >> >> > > >> >> >> > It seems the community encourages developers to experiment with their > >> >> >> > improvements in lab environments using meticulously designed test > >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately > >> >> >> > obtaining perfect data. However, it discourages developers from > >> >> >> > directly addressing real-world workloads. Sigh. > >> >> >> > >> >> >> You cannot know whether your workloads benefit or hurt for the different > >> >> >> batch number and how in your production environment? If you cannot, how > >> >> >> do you decide which workload deploys on which system (with different > >> >> >> batch number configuration). If you can, can you provide such > >> >> >> information to support your patch? > >> >> > > >> >> > We leverage a meticulous selection of network metrics, particularly > >> >> > focusing on TcpExt indicators, to keep a close eye on application > >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts, > >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans, > >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more. > >> >> > > >> >> > In instances where a problematic container terminates, we've noticed a > >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per > >> >> > second, which serves as a clear indication that other applications are > >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max > >> >> > parameter to 0, we've been able to drastically reduce the maximum > >> >> > frequency of these timeouts to less than one per second. > >> >> > >> >> Thanks a lot for sharing this. I learned much from it! > >> >> > >> >> > At present, we're selectively applying this adjustment to clusters > >> >> > that exclusively host the identified problematic applications, and > >> >> > we're closely monitoring their performance to ensure stability. To > >> >> > date, we've observed no network latency issues as a result of this > >> >> > change. However, we remain cautious about extending this optimization > >> >> > to other clusters, as the decision ultimately depends on a variety of > >> >> > factors. > >> >> > > >> >> > It's important to note that we're not eager to implement this change > >> >> > across our entire fleet, as we recognize the potential for unforeseen > >> >> > consequences. Instead, we're taking a cautious approach by initially > >> >> > applying it to a limited number of servers. This allows us to assess > >> >> > its impact and make informed decisions about whether or not to expand > >> >> > its use in the future. > >> >> > >> >> So, you haven't observed any performance hurt yet. Right? > >> > > >> > Right. > >> > > >> >> If you > >> >> haven't, I suggest you to keep the patch in your downstream kernel for a > >> >> while. In the future, if you find the performance of some workloads > >> >> hurts because of the new batch number, you can repost the patch with the > >> >> supporting data. If in the end, the performance of more and more > >> >> workloads is good with the new batch number. You may consider to make 0 > >> >> the default value :-) > >> > > >> > That is not how the real world works. > >> > > >> > In the real world: > >> > > >> > - No one knows what may happen in the future. > >> > Therefore, if possible, we should make systems flexible, unless > >> > there is a strong justification for using a hard-coded value. > >> > > >> > - Minimize changes whenever possible. > >> > These systems have been working fine in the past, even if with lower > >> > performance. Why make changes just for the sake of improving > >> > performance? Does the key metric of your performance data truly matter > >> > for their workload? > >> > >> These are good policy in your organization and business. But, it's not > >> necessary the policy that Linux kernel upstream should take. > > > > You mean the Upstream Linux kernel only designed for the lab ? > > > >> > >> Community needs to consider long-term maintenance overhead, so it adds > >> new ABI (such as sysfs knob) to kernel with the necessary justification. > >> In general, it prefer to use a good default value or an automatic > >> algorithm that works for everyone. Community tries avoiding (or fixing) > >> regressions as much as possible, but this will not stop kernel from > >> changing, even if it's big. > > > > Please explain to me why the kernel config is not ABI, but the sysctl is ABI. > > Linux kernel will not break ABI until the last users stop using it. However, you haven't given a clear reference why the systl is an ABI. > This usually means tens years if not forever. Kernel config options > aren't considered ABI, they are used by developers and distributions. > They come and go from version to version. > > >> > >> IIUC, because of the different requirements, there are upstream and > >> downstream kernels. > > > > The downstream developer backport features from the upsteam kernel, > > and if they find issues in the upstream kernel, they should contribute > > it back. That is how the Linux Community works, right ? > > Yes. If they are issues for upstream kernel too. > > -- > Best Regards, > Huang, Ying -- Regards Yafang