On Thu, Sep 14, 2023 at 04:30:56PM -0700, Yosry Ahmed wrote: [...] > > > > I think first you need to show if this (2 sec stale stats) is really a > > problem. > > That's the thing, my main concern is that if this causes a problem, we > probably won't be able to tell it was because of stale stats. It's > very hard to make that connection. > Please articulate what the problem would look like which you did in the use-cases description below, let's discuss there. > Pre-rstat, reading stats would always yield fresh stats (as much as > possible). Now the stats can be up to 2s stale, and we don't really > know how this will affect our existing workloads. > Pre-rstat the stat read would traverse the memcg tree. With rstat the tradeoff was made between expensive read and staleness. Yeah there might still be memcg update tree traversal which I would like to remove completely. However you are saying to [...] > > > > I don't see why userspace OOM killing and proactive reclaim need > > subsecond accuracy. Please explain. > > For proactive reclaim it is not about sub-second accuracy. It is about > doing the reclaim then reading the stats immediately to see the > effect. Naturally one would expect that a stat read after reclaim > would show the system state after reclaim. > > For userspace OOM killing I am not really sure. It depends on how > dynamic the workload is. If a task recently had a spike in memory > usage causing a threshold to be hit, userspace can kill a different > task if the stats are stale. > Please add above reasoning in your commit message (though I am not convinced but let's leave it at that). > I think the whole point is *not* about the amount of staleness. It is > more about that you expect a stats read after an event to reflect the > system state after the event. The whole point is to understand the tradeoff between accuracy and cost of accuracy. I don't think you want to pay the cost of strong consistency/ordering between stats reading and an event. My worry is that you are enforcing a tradeoff which *might* be just applicable to your use-cases. Anyways this is not something that can not be changed later. > > > Same for system overhead but I can > > see the complication of two different sources for stats. Can you provide > > the formula of system overhead? I am wondering why do you need to read > > stats from memory.stat files. Why not the memory.current of top level > > cgroups and /proc/meminfo be enough. Something like: > > > > Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current) > > We use the amount of compressed memory in zswap from memory.stat, > which is not accounted as memory usage in cgroup v1. > There are zswap stats in /proc/meminfo. Will those work for you? [...] > > Fix the in-kernel flushers separately. > > The in-kernel flushers are basically facing the same problem. For > instance, reclaim would expect a stats read after a reclaim iteration > to reflect the system state after the reclaim iteration. > I have not seen any complains on memory reclaim recently. Maybe reclaim does not really need that such accuracy :P > > Also the problem Cloudflare is facing does not need to be tied with this. > > When we try to wait for flushing to complete we run into the same > latency problem of the root flush. Not sure what wait for flushing has to do with Cloudflare's report. They are ok with no sync flushing at all stat read.