Re: [PATCH 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

"Huang, Ying" <ying.huang@xxxxxxxxx> · Mon, 15 Jul 2024 09:09:41 +0800

Yafang Shao <laoar.shao@xxxxxxxxx> writes:

> On Fri, Jul 12, 2024 at 5:24 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>>
>> On Fri, Jul 12, 2024 at 5:13 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >
>> > Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> >
>> > > On Fri, Jul 12, 2024 at 4:26 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >>
>> > >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >>
>> > >> > On Fri, Jul 12, 2024 at 3:06 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >>
>> > >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >>
>> > >> >> > On Fri, Jul 12, 2024 at 2:18 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >>
>> > >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >>
>> > >> >> >> > On Fri, Jul 12, 2024 at 1:26 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >>
>> > >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >>
>> > >> >> >> >> > On Fri, Jul 12, 2024 at 11:07 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >> >>
>> > >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >> >>
>> > >> >> >> >> >> > On Fri, Jul 12, 2024 at 9:21 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 6:51 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 4:20 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> > On Thu, Jul 11, 2024 at 2:44 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> > On Wed, Jul 10, 2024 at 10:51 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> > The configuration parameter PCP_BATCH_SCALE_MAX poses challenges for
>> > >> >> >> >> >> >> >> >> >> >> > quickly experimenting with specific workloads in a production environment,
>> > >> >> >> >> >> >> >> >> >> >> > particularly when monitoring latency spikes caused by contention on the
>> > >> >> >> >> >> >> >> >> >> >> > zone->lock. To address this, a new sysctl parameter vm.pcp_batch_scale_max
>> > >> >> >> >> >> >> >> >> >> >> > is introduced as a more practical alternative.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> In general, I'm neutral to the change.  I can understand that kernel
>> > >> >> >> >> >> >> >> >> >> >> configuration isn't as flexible as sysctl knob.  But, sysctl knob is ABI
>> > >> >> >> >> >> >> >> >> >> >> too.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> > To ultimately mitigate the zone->lock contention issue, several suggestions
>> > >> >> >> >> >> >> >> >> >> >> > have been proposed. One approach involves dividing large zones into multi
>> > >> >> >> >> >> >> >> >> >> >> > smaller zones, as suggested by Matthew[0], while another entails splitting
>> > >> >> >> >> >> >> >> >> >> >> > the zone->lock using a mechanism similar to memory arenas and shifting away
>> > >> >> >> >> >> >> >> >> >> >> > from relying solely on zone_id to identify the range of free lists a
>> > >> >> >> >> >> >> >> >> >> >> > particular page belongs to[1]. However, implementing these solutions is
>> > >> >> >> >> >> >> >> >> >> >> > likely to necessitate a more extended development effort.
>> > >> >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> >> Per my understanding, the change will hurt instead of improve zone->lock
>> > >> >> >> >> >> >> >> >> >> >> contention.  Instead, it will reduce page allocation/freeing latency.
>> > >> >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> >> > I'm quite perplexed by your recent comment. You introduced a
>> > >> >> >> >> >> >> >> >> >> > configuration that has proven to be difficult to use, and you have
>> > >> >> >> >> >> >> >> >> >> > been resistant to suggestions for modifying it to a more user-friendly
>> > >> >> >> >> >> >> >> >> >> > and practical tuning approach. May I inquire about the rationale
>> > >> >> >> >> >> >> >> >> >> > behind introducing this configuration in the beginning?
>> > >> >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> >> Sorry, I don't understand your words.  Do you need me to explain what is
>> > >> >> >> >> >> >> >> >> >> "neutral"?
>> > >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> > No, thanks.
>> > >> >> >> >> >> >> >> >> > After consulting with ChatGPT, I received a clear and comprehensive
>> > >> >> >> >> >> >> >> >> > explanation of what "neutral" means, providing me with a better
>> > >> >> >> >> >> >> >> >> > understanding of the concept.
>> > >> >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >> > So, can you explain why you introduced it as a config in the beginning ?
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> I think that I have explained it in the commit log of commit
>> > >> >> >> >> >> >> >> >> 52166607ecc9 ("mm: restrict the pcp batch scale factor to avoid too long
>> > >> >> >> >> >> >> >> >> latency").  Which introduces the config.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > What specifically are your expectations for how users should utilize
>> > >> >> >> >> >> >> >> > this config in real production workload?
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> >> Sysctl knob is ABI, which needs to be maintained forever.  Can you
>> > >> >> >> >> >> >> >> >> explain why you need it?  Why cannot you use a fixed value after initial
>> > >> >> >> >> >> >> >> >> experiments.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > Given the extensive scale of our production environment, with hundreds
>> > >> >> >> >> >> >> >> > of thousands of servers, it begs the question: how do you propose we
>> > >> >> >> >> >> >> >> > efficiently manage the various workloads that remain unaffected by the
>> > >> >> >> >> >> >> >> > sysctl change implemented on just a few thousand servers? Is it
>> > >> >> >> >> >> >> >> > feasible to expect us to recompile and release a new kernel for every
>> > >> >> >> >> >> >> >> > instance where the default value falls short? Surely, there must be
>> > >> >> >> >> >> >> >> > more practical and efficient approaches we can explore together to
>> > >> >> >> >> >> >> >> > ensure optimal performance across all workloads.
>> > >> >> >> >> >> >> >> >
>> > >> >> >> >> >> >> >> > When making improvements or modifications, kindly ensure that they are
>> > >> >> >> >> >> >> >> > not solely confined to a test or lab environment. It's vital to also
>> > >> >> >> >> >> >> >> > consider the needs and requirements of our actual users, along with
>> > >> >> >> >> >> >> >> > the diverse workloads they encounter in their daily operations.
>> > >> >> >> >> >> >> >>
>> > >> >> >> >> >> >> >> Have you found that your different systems requires different
>> > >> >> >> >> >> >> >> CONFIG_PCP_BATCH_SCALE_MAX value already?
>> > >> >> >> >> >> >> >
>> > >> >> >> >> >> >> > For specific workloads that introduce latency, we set the value to 0.
>> > >> >> >> >> >> >> > For other workloads, we keep it unchanged until we determine that the
>> > >> >> >> >> >> >> > default value is also suboptimal. What is the issue with this
>> > >> >> >> >> >> >> > approach?
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Firstly, this is a system wide configuration, not workload specific.
>> > >> >> >> >> >> >> So, other workloads run on the same system will be impacted too.  Will
>> > >> >> >> >> >> >> you run one workload only on one system?
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > It seems we're living on different planets. You're happily working in
>> > >> >> >> >> >> > your lab environment, while I'm struggling with real-world production
>> > >> >> >> >> >> > issues.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > For servers:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Server 1 to 10,000: vm.pcp_batch_scale_max = 0
>> > >> >> >> >> >> > Server 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> > >> >> >> >> >> > Server 1,000,001 and beyond: Happy with all values
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Is this hard to understand?
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > In other words:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > For applications:
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > Application 1 to 10,000: vm.pcp_batch_scale_max = 0
>> > >> >> >> >> >> > Application 10,001 to 1,000,000: vm.pcp_batch_scale_max = 5
>> > >> >> >> >> >> > Application 1,000,001 and beyond: Happy with all values
>> > >> >> >> >> >>
>> > >> >> >> >> >> Good to know this.  Thanks!
>> > >> >> >> >> >>
>> > >> >> >> >> >> >>
>> > >> >> >> >> >> >> Secondly, we need some evidences to introduce a new system ABI.  For
>> > >> >> >> >> >> >> example, we need to use different configuration on different systems
>> > >> >> >> >> >> >> otherwise some workloads will be hurt.  Can you provide some evidences
>> > >> >> >> >> >> >> to support your change?  IMHO, it's not good enough to say I don't know
>> > >> >> >> >> >> >> why I just don't want to change existing systems.  If so, it may be
>> > >> >> >> >> >> >> better to wait until you have more evidences.
>> > >> >> >> >> >> >
>> > >> >> >> >> >> > It seems the community encourages developers to experiment with their
>> > >> >> >> >> >> > improvements in lab environments using meticulously designed test
>> > >> >> >> >> >> > cases A, B, C, and as many others as they can imagine, ultimately
>> > >> >> >> >> >> > obtaining perfect data. However, it discourages developers from
>> > >> >> >> >> >> > directly addressing real-world workloads. Sigh.
>> > >> >> >> >> >>
>> > >> >> >> >> >> You cannot know whether your workloads benefit or hurt for the different
>> > >> >> >> >> >> batch number and how in your production environment?  If you cannot, how
>> > >> >> >> >> >> do you decide which workload deploys on which system (with different
>> > >> >> >> >> >> batch number configuration).  If you can, can you provide such
>> > >> >> >> >> >> information to support your patch?
>> > >> >> >> >> >
>> > >> >> >> >> > We leverage a meticulous selection of network metrics, particularly
>> > >> >> >> >> > focusing on TcpExt indicators, to keep a close eye on application
>> > >> >> >> >> > latency. This includes metrics such as TcpExt.TCPTimeouts,
>> > >> >> >> >> > TcpExt.RetransSegs, TcpExt.DelayedACKLost, TcpExt.TCPSlowStartRetrans,
>> > >> >> >> >> > TcpExt.TCPFastRetrans, TcpExt.TCPOFOQueue, and more.
>> > >> >> >> >> >
>> > >> >> >> >> > In instances where a problematic container terminates, we've noticed a
>> > >> >> >> >> > sharp spike in TcpExt.TCPTimeouts, reaching over 40 occurrences per
>> > >> >> >> >> > second, which serves as a clear indication that other applications are
>> > >> >> >> >> > experiencing latency issues. By fine-tuning the vm.pcp_batch_scale_max
>> > >> >> >> >> > parameter to 0, we've been able to drastically reduce the maximum
>> > >> >> >> >> > frequency of these timeouts to less than one per second.
>> > >> >> >> >>
>> > >> >> >> >> Thanks a lot for sharing this.  I learned much from it!
>> > >> >> >> >>
>> > >> >> >> >> > At present, we're selectively applying this adjustment to clusters
>> > >> >> >> >> > that exclusively host the identified problematic applications, and
>> > >> >> >> >> > we're closely monitoring their performance to ensure stability. To
>> > >> >> >> >> > date, we've observed no network latency issues as a result of this
>> > >> >> >> >> > change. However, we remain cautious about extending this optimization
>> > >> >> >> >> > to other clusters, as the decision ultimately depends on a variety of
>> > >> >> >> >> > factors.
>> > >> >> >> >> >
>> > >> >> >> >> > It's important to note that we're not eager to implement this change
>> > >> >> >> >> > across our entire fleet, as we recognize the potential for unforeseen
>> > >> >> >> >> > consequences. Instead, we're taking a cautious approach by initially
>> > >> >> >> >> > applying it to a limited number of servers. This allows us to assess
>> > >> >> >> >> > its impact and make informed decisions about whether or not to expand
>> > >> >> >> >> > its use in the future.
>> > >> >> >> >>
>> > >> >> >> >> So, you haven't observed any performance hurt yet.  Right?
>> > >> >> >> >
>> > >> >> >> > Right.
>> > >> >> >> >
>> > >> >> >> >> If you
>> > >> >> >> >> haven't, I suggest you to keep the patch in your downstream kernel for a
>> > >> >> >> >> while.  In the future, if you find the performance of some workloads
>> > >> >> >> >> hurts because of the new batch number, you can repost the patch with the
>> > >> >> >> >> supporting data.  If in the end, the performance of more and more
>> > >> >> >> >> workloads is good with the new batch number.  You may consider to make 0
>> > >> >> >> >> the default value :-)
>> > >> >> >> >
>> > >> >> >> > That is not how the real world works.
>> > >> >> >> >
>> > >> >> >> > In the real world:
>> > >> >> >> >
>> > >> >> >> > - No one knows what may happen in the future.
>> > >> >> >> >   Therefore, if possible, we should make systems flexible, unless
>> > >> >> >> > there is a strong justification for using a hard-coded value.
>> > >> >> >> >
>> > >> >> >> > - Minimize changes whenever possible.
>> > >> >> >> >   These systems have been working fine in the past, even if with lower
>> > >> >> >> > performance. Why make changes just for the sake of improving
>> > >> >> >> > performance? Does the key metric of your performance data truly matter
>> > >> >> >> > for their workload?
>> > >> >> >>
>> > >> >> >> These are good policy in your organization and business.  But, it's not
>> > >> >> >> necessary the policy that Linux kernel upstream should take.
>> > >> >> >
>> > >> >> > You mean the Upstream Linux kernel only designed for the lab ?
>> > >> >> >
>> > >> >> >>
>> > >> >> >> Community needs to consider long-term maintenance overhead, so it adds
>> > >> >> >> new ABI (such as sysfs knob) to kernel with the necessary justification.
>> > >> >> >> In general, it prefer to use a good default value or an automatic
>> > >> >> >> algorithm that works for everyone.  Community tries avoiding (or fixing)
>> > >> >> >> regressions as much as possible, but this will not stop kernel from
>> > >> >> >> changing, even if it's big.
>> > >> >> >
>> > >> >> > Please explain to me why the kernel config is not ABI, but the sysctl is ABI.
>> > >> >>
>> > >> >> Linux kernel will not break ABI until the last users stop using it.
>> > >> >
>> > >> > However, you haven't given a clear reference why the systl is an ABI.
>> > >>
>> > >> TBH, I don't find a formal document said it explicitly after some
>> > >> searching.
>> > >>
>> > >> Hi, Andrew, Matthew,
>> > >>
>> > >> Can you help me on this?  Whether sysctl is considered Linux kernel ABI?
>> > >> Or something similar?
>> > >
>> > > In my experience, we consistently utilize an if-statement to configure
>> > > sysctl settings in our production environments.
>> > >
>> > >     if [ -f ${sysctl_file} ]; then
>> > >         echo ${new_value} > ${sysctl_file}
>> > >     fi
>> > >
>> > > Additionally, you can incorporate this into rc.local to ensure the
>> > > configuration is applied upon system reboot.
>> > >
>> > > Even if you add it to the sysctl.conf without the if-statement, it
>> > > won't break anything.
>> > >
>> > > The pcp-related sysctl parameter, vm.percpu_pagelist_high_fraction,
>> > > underwent a naming change along with a functional update from its
>> > > predecessor, vm.percpu_pagelist_fraction, in commit 74f44822097c
>> > > ("mm/page_alloc: introduce vm.percpu_pagelist_high_fraction"). Despite
>> > > this significant change, there have been no reported issues or
>> > > complaints, suggesting that the renaming and functional update have
>> > > not negatively impacted the system's functionality.
>> >
>> > Thanks for your information.  From the commit, sysctl isn't considered
>> > as the kernel ABI.
>> >
>> > Even if so, IMHO, we shouldn't introduce a user tunable knob without a
>> > real world requirements except more flexibility.
>>
>> Indeed, I do not reside in the physical realm but within a virtualized
>> universe. (Of course, that is your perspective.)
>
> One final note: you explained very well what "neutral" means. Thank
> you for your comments.

Originally, my opinion to the change is neutral.  But, after more
thoughts, I changed my opinion to "we need more evidence to prove the
knob is necessary".

--
Best Regards,
Huang, Ying