Re: [PATCH] mm: memcg: Use larger chunks for proactive reclaim

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 1 Feb 2024 10:34:28 -0500

On Thu, Feb 01, 2024 at 02:57:22PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Wed, Jan 31, 2024 at 04:24:41PM +0000, "T.J. Mercier" <tjmercier@xxxxxxxxxx> wrote:
> >  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > -					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > +					max((nr_to_reclaim - nr_reclaimed) / 4,
> > +					    (nr_to_reclaim - nr_reclaimed) % 4),
> 
> The 1/4 factor looks like magic.

It's just cutting the work into quarters to balance throughput with
goal accuracy. It's no more or less magic than DEF_PRIORITY being 12,
or SWAP_CLUSTER_MAX being 32.

> Commit 0388536ac291 says:
> | In theory, the amount of reclaimed would be in [request, 2 * request).

Looking at the code, I'm not quite sure if this can be read this
literally. Efly might be able to elaborate, but we do a full loop of
all nodes and cgroups in the tree before checking nr_to_reclaimed, and
rely on priority level for granularity. So request size and complexity
of the cgroup tree play a role. I don't know where the exact factor
two would come from.

IMO it's more accurate to phrase it like this:

Reclaim tries to balance nr_to_reclaim fidelity with fairness across
nodes and cgroups over which the pages are spread. As such, the bigger
the request, the bigger the absolute overreclaim error. Historic
in-kernel users of reclaim have used fixed, small request batches to
approach an appropriate reclaim rate over time. When we reclaim a user
request of arbitrary size, use decaying batches to manage error while
maintaining reasonable throughput.

> Doesn't this suggest 1/2 as a better option? (I didn't pursue the
> theory.)

That was TJ's first suggestion as well, but as per above I suggested
quartering as a safer option.

> Also IMO importantly, when nr_to_reclaim - nr_reclaimed is less than 8,
> the formula gives arbitrary (unrelated to delta's magnitude) values.

try_to_free_mem_cgroup_pages() rounds up to SWAP_CLUSTER_MAX. So the
error margin is much higher at the smaller end of requests anyway.
But practically speaking, users care much less if you reclaim 32 pages
when 16 were requested than if you reclaim 2G when 1G was requested.