Re: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim

"T.J. Mercier" <tjmercier@xxxxxxxxxx> · Mon, 5 Feb 2024 20:01:40 -0800

On Mon, Feb 5, 2024 at 1:16 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Mon 05-02-24 12:47:47, T.J. Mercier wrote:
> > On Mon, Feb 5, 2024 at 12:36 PM Michal Hocko <mhocko@xxxxxxxx> wrote:
> [...]
> > > This of something like
> > > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim
> > > where timeout acts as a stop gap if the reclaim cannot finish in
> > > TIMEOUT.
> >
> > Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve
> > it is what's bothering me.
>
> I am not really happy about this subtlety. If we have a better way then
> let's do it. Better in its own patch, though.
>
> > It's already wired up that way though, so if you want to make this
> > change now then I can try to test for the difference using really
> > large reclaim targets.
>
> Yes, please. If you want it a separate patch then no objection from me
> of course. If you do no like the nr_to_reclaim bailout then maybe we can
> go with a simple break out flag in scan_control.
>
> Thanks!

It's a bit difficult to test under the too_many_isolated check, so I
moved the fatal_signal_pending check outside and tried with that.
Performing full reclaim on the /uid_0 cgroup with a 250ms delay before
SIGKILL, I got an average of 16ms better latency with
sc->nr_to_reclaim across 20 runs ignoring one 1s outlier with
SWAP_CLUSTER_MAX.

The return values from memory_reclaim are different since with
sc->nr_to_reclaim we "succeed" and don't reach the signal_pending
check to return -EINTR, but I don't think it matters since the return
code is 137 (SIGKILL) in both cases.

With SWAP_CLUSTER_MAX there was an outlier at nearly 1s, and in
general the latency numbers were noiser: 2% RSD vs 13% RSD. I'm
guessing that's a function of nr_to_scan being occasionally much less
than SWAP_CLUSTER_MAX causing nr[lru] to drain slowly. But it could
also have simply been scheduled out more often at the cond_resched in
shrink_lruvec, and that would help explain the 1s outlier. I don't
have enough debug info on the outlier to say much more.

With sc->nr_to_reclaim, the largest sc->nr_reclaimed value I saw was
about 2^53 for a sc->nr_to_reclaim of 2^51, but for large memcg
hierarchies I think it's possible to get more than that. There were
only 15 cgroups under /uid_0. This is the only thing that gives me
pause, since we could touch more than 2k cgroups in
shrink_node_memcgs, each one adding 4 * 2^51, potentially overflowing
sc->nr_to_reclaim. Looks testable but I didn't get to it.