Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 25 Jun 2024 13:45:00 -0700

On Tue, Jun 25, 2024 at 9:21 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> On Tue, Jun 25, 2024 at 09:00:03AM GMT, Yosry Ahmed wrote:
> [...]
> >
> > My point is not about accuracy, although I think it's a reasonable
> > argument on its own (a lot of things could change in a short amount of
> > time, which is why I prefer magnitude-based ratelimiting).
> >
> > My point is about logical ordering. If a userspace program reads the
> > stats *after* an event occurs, it expects to get a snapshot of the
> > system state after that event. Two examples are:
> >
> > - A proactive reclaimer reading the stats after a reclaim attempt to
> > check if it needs to reclaim more memory or fallback.
> > - A userspace OOM killer reading the stats after a usage spike to
> > decide which workload to kill.
> >
> > I listed such examples with more detail in [1], when I removed
> > stats_flush_ongoing from the memcg code.
> >
> > [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/
>
> You are kind of arbitrarily adding restrictions and rules here. Why not
> follow the rules of a well established and battle tested stats infra
> used by everyone i.e. vmstats? There is no sync flush and there are
> frequent async flushes. I think that is what Jesper wants as well.

That's how the memcg stats worked previously since before rstat and
until the introduction of stats_flush_ongoing AFAICT. We saw an actual
behavioral change when we were moving from a pre-rstat kernel to a
kernel with stats_flush_ongoing. This was the rationale when I removed
stats_flush_ongoing in [1]. It's not a new argument, I am just
reiterating what we discussed back then.

We saw an actual change in the proactive reclaimer as sometimes the
stats read after the reclaim attempt did not reflect the actual state
of the system. Sometimes the proactive reclaimer would back off when
it shouldn't, because it thinks it didn't reclaim memory when it
actually did.

Upon further investigation, we realized that this could also affect
the userspace OOM killer, because it uses the memcg stats to figure
out which memcg will free most memory if it was killed (by looking at
the anon stats, among others). If a memory usage spike occurs, and we
read stats from before the spike, we may kill the wrong memcg.

So as you said, we can experiment with in-kernel flushers, but let's
keep userspace flushing consistent.

Taking a step back, I just want to clarify that my arguments for the
flushing changes, whether it's in this patch or your ratelimiting
patch, are from a purely technical perspective. I am making
suggestions that I believe may be better. I am not trying to stop any
progress in this area or stand in the way. The only thing I really
don't want is affecting userspace flushers as I described above.

[1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/