Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Tue, 25 Jun 2024 14:24:35 -0700

On Tue, Jun 25, 2024 at 2:20 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> On Tue, Jun 25, 2024 at 01:45:00PM GMT, Yosry Ahmed wrote:
> > On Tue, Jun 25, 2024 at 9:21 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> > >
> > > On Tue, Jun 25, 2024 at 09:00:03AM GMT, Yosry Ahmed wrote:
> > > [...]
> > > >
> > > > My point is not about accuracy, although I think it's a reasonable
> > > > argument on its own (a lot of things could change in a short amount of
> > > > time, which is why I prefer magnitude-based ratelimiting).
> > > >
> > > > My point is about logical ordering. If a userspace program reads the
> > > > stats *after* an event occurs, it expects to get a snapshot of the
> > > > system state after that event. Two examples are:
> > > >
> > > > - A proactive reclaimer reading the stats after a reclaim attempt to
> > > > check if it needs to reclaim more memory or fallback.
> > > > - A userspace OOM killer reading the stats after a usage spike to
> > > > decide which workload to kill.
> > > >
> > > > I listed such examples with more detail in [1], when I removed
> > > > stats_flush_ongoing from the memcg code.
> > > >
> > > > [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/
> > >
> > > You are kind of arbitrarily adding restrictions and rules here. Why not
> > > follow the rules of a well established and battle tested stats infra
> > > used by everyone i.e. vmstats? There is no sync flush and there are
> > > frequent async flushes. I think that is what Jesper wants as well.
> >
> > That's how the memcg stats worked previously since before rstat and
> > until the introduction of stats_flush_ongoing AFAICT. We saw an actual
> > behavioral change when we were moving from a pre-rstat kernel to a
> > kernel with stats_flush_ongoing. This was the rationale when I removed
> > stats_flush_ongoing in [1]. It's not a new argument, I am just
> > reiterating what we discussed back then.
>
> In my reply above, I am not arguing to go back to the older
> stats_flush_ongoing situation. Rather I am discussing what should be the
> best eventual solution. From the vmstats infra, we can learn that
> frequent async flushes along with no sync flush, users are fine with the
> 'non-determinism'. Of course cgroup stats are different from vmstats
> i.e. are hierarchical but I think we can try out this approach and see
> if this works or not.

If we do not do sync flushing, then the same problem that happened
with stats_flush_ongoing could occur again, right? Userspace could
read the stats after an event, and get a snapshot of the system before
that event.

Perhaps this is fine for vmstats if it has always been like that (I
have no idea), or if no users make assumptions about this. But for
cgroup stats, we have use cases that rely on this behavior.

>
> BTW it seems like this topic should be discussed be discussed
> face-to-face over vc or LPC. What do you folks thing?

I am not going to be at LPC, but I am happy to discuss this over VC.