On Thu, Sep 14, 2023 at 10:56:52AM -0700, Yosry Ahmed wrote: [...] > > > > 1. How much delayed/stale stats have you observed on real world workload? > > I am not really sure. We don't have a wide deployment of kernels with > rstat yet. These are problems observed in testing and/or concerns > expressed by our userspace team. > Why sleep(2) not good enough for the tests? > I am trying to solve this now because any problems that result from > this staleness will be very hard to debug and link back to stale > stats. > I think first you need to show if this (2 sec stale stats) is really a problem. > > > > 2. What is acceptable staleness in the stats for your use-case? > > Again, unfortunately I am not sure, but right now it can be O(seconds) > which is not acceptable as we have workloads querying the stats every > 1s (and sometimes more frequently). > It is 2 seconds in most cases and if it is higher, the system is already in bad shape. O(seconds) seems more dramatic. So, why 2 seconds staleness is not acceptable? Is 1 second acceptable? or 500 msec? Let's look at the use-cases below. > > > > 3. What is your use-case? > > A few use cases we have that may be affected by this: > - System overhead: calculations using memory.usage and some stats from > memory.stat. If one of them is fresh and the other one isn't we have > an inconsistent view of the system. > - Userspace OOM killing: We use some stats in memory.stat to gauge the > amount of memory that will be freed by killing a task as sometimes > memory.usage includes shared resources that wouldn't be freed anyway. > - Proactive reclaim: we read memory.stat in a proactive reclaim > feedback loop, stale stats may cause us to mistakenly think reclaim is > ineffective and prematurely stop. > I don't see why userspace OOM killing and proactive reclaim need subsecond accuracy. Please explain. Same for system overhead but I can see the complication of two different sources for stats. Can you provide the formula of system overhead? I am wondering why do you need to read stats from memory.stat files. Why not the memory.current of top level cgroups and /proc/meminfo be enough. Something like: Overhead = MemTotal - MemFree - SumOfTopCgroups(memory.current) > > > > I know I am going back on some of the previous agreements but this > > whole locking back and forth has made in question the original > > motivation. > > That's okay. Taking a step back, having flushing being indeterministic I would say atmost 2 second stale instead of indeterministic. > in this way is a time bomb in my opinion. Note that this also affects > in-kernel flushers like reclaim or dirty isolation Fix the in-kernel flushers separately. Also the problem Cloudflare is facing does not need to be tied with this.