Re: [PATCH] Revert "mm:vmscan: fix inaccurate reclaim during proactive reclaim"

"T.J. Mercier" <tjmercier@xxxxxxxxxx> · Wed, 24 Jan 2024 09:14:40 -0800

On Tue, Jan 23, 2024 at 8:19 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Tue 23-01-24 05:58:05, T.J. Mercier wrote:
> > On Tue, Jan 23, 2024 at 1:33 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
> > >
> > > On Sun 21-01-24 21:44:12, T.J. Mercier wrote:
> > > > This reverts commit 0388536ac29104a478c79b3869541524caec28eb.
> > > >
> > > > Proactive reclaim on the root cgroup is 10x slower after this patch when
> > > > MGLRU is enabled, and completion times for proactive reclaim on much
> > > > smaller non-root cgroups take ~30% longer (with or without MGLRU).
> > >
> > > What is the reclaim target in these pro-active reclaim requests?
> >
> > Two targets:
> > 1) /sys/fs/cgroup/memory.reclaim
> > 2) /sys/fs/cgroup/uid_0/memory.reclaim (a bunch of Android system services)
>
> OK, I was not really clear. I was curious about nr_to_reclaim.
>
> > Note that lru_gen_shrink_node is used for 1, but shrink_node_memcgs is
> > used for 2.
> >
> > The 10x comes from the rate of reclaim (~70k pages/sec vs ~6.6k
> > pages/sec) for 1. After this revert the root reclaim took only about
> > 10 seconds. Before the revert it's still running after about 3 minutes
> > using a core at 100% the whole time, and I'm too impatient to wait
> > longer to record times for comparison.
> >
> > The 30% comes from the average of a few runs for 2:
> > Before revert:
> > $ adb wait-for-device && sleep 120 && adb root && adb shell -t 'time
> > echo "" > /sys/fs/cgroup/uid_0/memory.reclaim'
>
> Ohh, so you want to reclaim all of it (resp. as much as possible).
>
Right, the main use-case here is we decide an application should be
backgrounded and its cgroup frozen. Before freezing, reclaim as much
as possible so that the frozen processes' RAM use is as low as
possible while they're dormant.

> [...]
>
> > > > After the patch the reclaim rate is
> > > > consistently ~6.6k pages/sec due to the reduced nr_pages value causing
> > > > scan aborts as soon as SWAP_CLUSTER_MAX pages are reclaimed. The
> > > > proactive reclaim doesn't complete after several minutes because
> > > > try_to_free_mem_cgroup_pages is still capable of reclaiming pages in
> > > > tiny SWAP_CLUSTER_MAX page chunks and nr_retries is never decremented.
> > >
> > > I do not understand this part. How does a smaller reclaim target manages
> > > to have reclaimed > 0 while larger one doesn't?
> >
> > They both are able to make progress. The main difference is that a
> > single iteration of try_to_free_mem_cgroup_pages with MGLRU ends soon
> > after it reclaims nr_to_reclaim, and before it touches all memcgs. So
> > a single iteration really will reclaim only about SWAP_CLUSTER_MAX-ish
> > pages with MGLRU. WIthout MGLRU the memcg walk is not aborted
> > immediately after nr_to_reclaim is reached, so a single call to
> > try_to_free_mem_cgroup_pages can actually reclaim thousands of pages
> > even when sc->nr_to_reclaim is 32. (I.E. MGLRU overreclaims less.)
> > https://lore.kernel.org/lkml/20221201223923.873696-1-yuzhao@xxxxxxxxxx/
>
> OK, I do see how try_to_free_mem_cgroup_pages might over reclaim but I
> do not really follow how increasing the batch actually fixes the issue
> that there is always progress being made and therefore memory_reclaim
> takes ages to terminates?

Oh, because the page reclaim rate with a small batch is just much
lower than with a very large batch. We have to restart reclaim from
fresh each time a batch is completed before we get to a place where
we're actually freeing/swapping pages again. That setup cost is
amortized over many more pages with a large batch size, but appears to
be pretty significant for small batch sizes.