Re: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 5 Aug 2024 11:17:26 +0800

On Mon, Aug 5, 2024 at 11:05 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>
> > On Mon, Aug 5, 2024 at 9:41 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Why introduce a systl knob?
> >> > ===========================
> >> >
> >> > From the above data, it's clear that different CPU types have varying
> >> > allocation latencies concerning zone->lock contention. Typically, people
> >> > don't release individual kernel packages for each type of x86_64 CPU.
> >> >
> >> > Furthermore, for latency-insensitive applications, we can keep the default
> >> > setting for better throughput.
> >>
> >> Do you have any data to prove that the default setting is better for
> >> throughput?  If so, that will be a strong support for your patch.
> >
> > No, I don't. The primary reason we can't change the default value from
> > 5 to 0 across our fleet of servers is that you initially set it to 5.
> > The sysadmins believe you had a strong reason for setting it to 5 by
> > default; otherwise, it would be considered careless for the upstream
> > kernel. I also believe you must have had a solid justification for
> > setting the default value to 5; otherwise, why would you have
> > submitted your patches?
>
> In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
> avoid too long latency"), I tried my best to run test on the machines
> available with a micro-benchmark (will-it-scale/page_fault1) which
> exercises kernel page allocator heavily.  From the data in commit,
> larger CONFIG_PCP_BATCH_SCALE_MAX helps throughput a little, but not
> much.  The 99% alloc/free latency can be kept within about 100us with
> CONFIG_PCP_BATCH_SCALE_MAX == 5.  So, we chose 5 as default value.
>
> But, we can always improve the default value with more data, on more
> types of machines and with more types of benchmarks, etc.
>
> Your data suggest smaller default value because you have data to show
> that larger default value has the latency spike issue (as large as tens
> ms) for some practical workloads.  Which weren't tested previously.  In
> contrast, we don't have strong data to show the throughput advantages of
> larger CONFIG_PCP_BATCH_SCALE_MAX value.
>
> So, I suggest to use a smaller default value for
> CONFIG_PCP_BATCH_SCALE_MAX.  But, we may need more test to check the
> data for 1, 2, 3, and 4, in addtion to 0 and 5 to determine the best
> choice.

Which smaller default value would be better? How can we ensure that
other workloads, which we haven't tested, will work well with this new
default value? If you have a better default value in mind, would you
consider sending a patch for it? I would be happy to test it with my
test case.

--
Regards
Yafang