On Thu, Sep 14, 2023 at 6:01 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Thu, Sep 14, 2023 at 04:30:56PM -0700, Yosry Ahmed wrote: > [...] > > > > > > I think first you need to show if this (2 sec stale stats) is really a > > > problem. > > > > That's the thing, my main concern is that if this causes a problem, we > > probably won't be able to tell it was because of stale stats. It's > > very hard to make that connection. > > > > Please articulate what the problem would look like which you did in the > use-cases description below, let's discuss there. > > > Pre-rstat, reading stats would always yield fresh stats (as much as > > possible). Now the stats can be up to 2s stale, and we don't really > > know how this will affect our existing workloads. > > > > Pre-rstat the stat read would traverse the memcg tree. With rstat > the tradeoff was made between expensive read and staleness. > Yeah there > might still be memcg update tree traversal which I would like to remove > completely. However you are saying to I think this sentence is truncated. > > [...] > > > > > > I don't see why userspace OOM killing and proactive reclaim need > > > subsecond accuracy. Please explain. > > > > For proactive reclaim it is not about sub-second accuracy. It is about > > doing the reclaim then reading the stats immediately to see the > > effect. Naturally one would expect that a stat read after reclaim > > would show the system state after reclaim. > > > > For userspace OOM killing I am not really sure. It depends on how > > dynamic the workload is. If a task recently had a spike in memory > > usage causing a threshold to be hit, userspace can kill a different > > task if the stats are stale. > > > > Please add above reasoning in your commit message (though I am not > convinced but let's leave it at that). Will do in the next version, thanks. > > > I think the whole point is *not* about the amount of staleness. It is > > more about that you expect a stats read after an event to reflect the > > system state after the event. > > The whole point is to understand the tradeoff between accuracy and cost > of accuracy. I don't think you want to pay the cost of strong > consistency/ordering between stats reading and an event. My worry is > that you are enforcing a tradeoff which *might* be just applicable to > your use-cases. Anyways this is not something that can not be changed > later. Given the numbers I got with the patch, it doesn't seem like we are paying a significant cost for the accuracy. Anyway, as you say, it's not something that can not be changed. In fact, I have another proposal that I am currently testing, please see my next response to Johannes. > > > > > > Same for system overhead but I can > > > see the complication of two different sources for stats. Can you provide > > > the formula of system overhead? I am wondering why do you need to read > > > stats from memory.stat files. Why not the memory.current of top level > > > cgroups and /proc/meminfo be enough. Something like: > > > > > > Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current) > > > > We use the amount of compressed memory in zswap from memory.stat, > > which is not accounted as memory usage in cgroup v1. > > > > There are zswap stats in /proc/meminfo. Will those work for you? Yeah this should work for this specific use case, thanks. > > [...] > > > Fix the in-kernel flushers separately. > > > > The in-kernel flushers are basically facing the same problem. For > > instance, reclaim would expect a stats read after a reclaim iteration > > to reflect the system state after the reclaim iteration. > > > > I have not seen any complains on memory reclaim recently. Maybe > reclaim does not really need that such accuracy :P Perhaps, it's full of heuristics anyway :) > > > > Also the problem Cloudflare is facing does not need to be tied with this. > > > > When we try to wait for flushing to complete we run into the same > > latency problem of the root flush. > > Not sure what wait for flushing has to do with Cloudflare's report. They > are ok with no sync flushing at all stat read. Oh I am not saying the wait benefits their use case. I am saying when the wait is implemented, we face the same problem (expensive flush of the entire hierarchy), so we need to mitigate it anyway -- hence the relevance to Cloudflare's use case. Anyway, I have an alternative that I will propose shortly in response to Johannes's reply.