Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy

Shakeel Butt <shakeelb@xxxxxxxxxx> · Thu, 14 Sep 2023 22:58:44 +0000

On Thu, Sep 14, 2023 at 10:56:52AM -0700, Yosry Ahmed wrote:
[...]
> >
> > 1. How much delayed/stale stats have you observed on real world workload?
> 
> I am not really sure. We don't have a wide deployment of kernels with
> rstat yet. These are problems observed in testing and/or concerns
> expressed by our userspace team.
> 

Why sleep(2) not good enough for the tests?

> I am trying to solve this now because any problems that result from
> this staleness will be very hard to debug and link back to stale
> stats.
> 

I think first you need to show if this (2 sec stale stats) is really a
problem.

> >
> > 2. What is acceptable staleness in the stats for your use-case?
> 
> Again, unfortunately I am not sure, but right now it can be O(seconds)
> which is not acceptable as we have workloads querying the stats every
> 1s (and sometimes more frequently).
> 

It is 2 seconds in most cases and if it is higher, the system is already
in bad shape. O(seconds) seems more dramatic. So, why 2 seconds
staleness is not acceptable? Is 1 second acceptable? or 500 msec? Let's
look at the use-cases below.

> >
> > 3. What is your use-case?
> 
> A few use cases we have that may be affected by this:
> - System overhead: calculations using memory.usage and some stats from
> memory.stat. If one of them is fresh and the other one isn't we have
> an inconsistent view of the system.
> - Userspace OOM killing: We use some stats in memory.stat to gauge the
> amount of memory that will be freed by killing a task as sometimes
> memory.usage includes shared resources that wouldn't be freed anyway.
> - Proactive reclaim: we read memory.stat in a proactive reclaim
> feedback loop, stale stats may cause us to mistakenly think reclaim is
> ineffective and prematurely stop.
> 

I don't see why userspace OOM killing and proactive reclaim need
subsecond accuracy. Please explain. Same for system overhead but I can
see the complication of two different sources for stats. Can you provide
the formula of system overhead? I am wondering why do you need to read
stats from memory.stat files. Why not the memory.current of top level
cgroups and /proc/meminfo be enough. Something like:

Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current)

> >
> > I know I am going back on some of the previous agreements but this
> > whole locking back and forth has made in question the original
> > motivation.
> 
> That's okay. Taking a step back, having flushing being indeterministic

I would say atmost 2 second stale instead of indeterministic.

> in this way is a time bomb in my opinion. Note that this also affects
> in-kernel flushers like reclaim or dirty isolation

Fix the in-kernel flushers separately. Also the problem Cloudflare is
facing does not need to be tied with this.