Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy

Shakeel Butt <shakeelb@xxxxxxxxxx> · Fri, 15 Sep 2023 01:01:38 +0000

On Thu, Sep 14, 2023 at 04:30:56PM -0700, Yosry Ahmed wrote:
[...]
> >
> > I think first you need to show if this (2 sec stale stats) is really a
> > problem.
> 
> That's the thing, my main concern is that if this causes a problem, we
> probably won't be able to tell it was because of stale stats. It's
> very hard to make that connection.
> 

Please articulate what the problem would look like which you did in the
use-cases description below, let's discuss there.

> Pre-rstat, reading stats would always yield fresh stats (as much as
> possible). Now the stats can be up to 2s stale, and we don't really
> know how this will affect our existing workloads.
> 

Pre-rstat the stat read would traverse the memcg tree. With rstat
the tradeoff was made between expensive read and staleness.
Yeah there
might still be memcg update tree traversal which I would like to remove
completely. However you are saying to 

[...]
> >
> > I don't see why userspace OOM killing and proactive reclaim need
> > subsecond accuracy. Please explain.
> 
> For proactive reclaim it is not about sub-second accuracy. It is about
> doing the reclaim then reading the stats immediately to see the
> effect. Naturally one would expect that a stat read after reclaim
> would show the system state after reclaim.
> 
> For userspace OOM killing I am not really sure. It depends on how
> dynamic the workload is. If a task recently had a spike in memory
> usage causing a threshold to be hit, userspace can kill a different
> task if the stats are stale.
> 

Please add above reasoning in your commit message (though I am not
convinced but let's leave it at that).

> I think the whole point is *not* about the amount of staleness. It is
> more about that you expect a stats read after an event to reflect the
> system state after the event.

The whole point is to understand the tradeoff between accuracy and cost
of accuracy. I don't think you want to pay the cost of strong
consistency/ordering between stats reading and an event. My worry is
that you are enforcing a tradeoff which *might* be just applicable to
your use-cases. Anyways this is not something that can not be changed
later.

> 
> > Same for system overhead but I can
> > see the complication of two different sources for stats. Can you provide
> > the formula of system overhead? I am wondering why do you need to read
> > stats from memory.stat files. Why not the memory.current of top level
> > cgroups and /proc/meminfo be enough. Something like:
> >
> > Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)
> 
> We use the amount of compressed memory in zswap from memory.stat,
> which is not accounted as memory usage in cgroup v1.
> 

There are zswap stats in /proc/meminfo. Will those work for you?

[...]
> > Fix the in-kernel flushers separately.
> 
> The in-kernel flushers are basically facing the same problem. For
> instance, reclaim would expect a stats read after a reclaim iteration
> to reflect the system state after the reclaim iteration.
> 

I have not seen any complains on memory reclaim recently. Maybe
reclaim does not really need that such accuracy :P

> > Also the problem Cloudflare is facing does not need to be tied with this.
> 
> When we try to wait for flushing to complete we run into the same
> latency problem of the root flush.

Not sure what wait for flushing has to do with Cloudflare's report. They
are ok with no sync flushing at all stat read.