Re: [PATCH] mm: Enable setting -1 for vm.percpu_pagelist_high_fraction to set the minimum pagelist

"Huang, Ying" <ying.huang@xxxxxxxxx> · Tue, 02 Jul 2024 15:23:32 +0800

Hi, Yafang,

Yafang Shao <laoar.shao@xxxxxxxxx> writes:

> Currently, we're encountering latency spikes in our container environment
> when a specific container with multiple Python-based tasks exits.

Can you show some data?  On which kind of machine, how long is the
latency?

> These
> tasks may hold the zone->lock for an extended period, significantly
> impacting latency for other containers attempting to allocate memory.

So, the allocation latency is influenced, not application exit latency?
Could you measure the run time of free_pcppages_bulk(), this can be done
via ftrace function_graph tracer.  We want to check whether this is a
common issue.

In commit 52166607ecc9 ("mm: restrict the pcp batch scale factor to
avoid too long latency"), we have measured the allocation/free latency
for different CONFIG_PCP_BATCH_SCALE_MAX.  The target in the commit is
to control the latency <= 100us.

> As a workaround, we've found that minimizing the pagelist size, such as
> setting it to 4 times the batch size, can help mitigate these spikes.
> However, managing vm.percpu_pagelist_high_fraction across a large fleet of
> servers poses challenges due to variations in CPU counts, NUMA nodes, and
> physical memory capacities.
>
> To enhance practicality, we propose allowing the setting of -1 for
> vm.percpu_pagelist_high_fraction to designate a minimum pagelist size.

If it is really necessary, can we just use a large enough number for
vm.percpu_pagelist_high_fraction?  For example, (1 << 30)?

> Furthermore, considering the challenges associated with utilizing
> vm.percpu_pagelist_high_fraction, it would be beneficial to introduce a
> more intuitive parameter, vm.percpu_pagelist_high_size, that would permit
> direct specification of the pagelist size as a multiple of the batch size.
> This methodology would mirror the functionality of vm.dirty_ratio and
> vm.dirty_bytes, providing users with greater flexibility and control.
>
> We have discussed the possibility of introducing multiple small zones to
> mitigate the contention on the zone->lock[0], but this approach is likely
> to require a longer-term implementation effort.
>
> Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@xxxxxxxxxxxxxxxxxxxx/ [0]
> Signed-off-by: Yafang Shao <laoar.shao@xxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> Cc: David Rientjes <rientjes@xxxxxxxxxx>
> Cc: "Huang, Ying" <ying.huang@xxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

[snip]

--
Best Regards,
Huang, Ying