On Thu, Sep 14, 2023 at 10:36 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Wed, Sep 13, 2023 at 12:38 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > Stats flushing for memcg currently follows the following rules: > > - Always flush the entire memcg hierarchy (i.e. flush the root). > > - Only one flusher is allowed at a time. If someone else tries to flush > > concurrently, they skip and return immediately. > > - A periodic flusher flushes all the stats every 2 seconds. > > > > The reason this approach is followed is because all flushes are > > serialized by a global rstat spinlock. On the memcg side, flushing is > > invoked from userspace reads as well as in-kernel flushers (e.g. > > reclaim, refault, etc). This approach aims to avoid serializing all > > flushers on the global lock, which can cause a significant performance > > hit under high concurrency. > > > > This approach has the following problems: > > - Occasionally a userspace read of the stats of a non-root cgroup will > > be too expensive as it has to flush the entire hierarchy [1]. > > This is a real world workload exhibiting the issue which is good. > > > - Sometimes the stats accuracy are compromised if there is an ongoing > > flush, and we skip and return before the subtree of interest is > > actually flushed. This is more visible when reading stats from > > userspace, but can also affect in-kernel flushers. > > Please provide similar data/justification for the above. In addition: > > 1. How much delayed/stale stats have you observed on real world workload? I am not really sure. We don't have a wide deployment of kernels with rstat yet. These are problems observed in testing and/or concerns expressed by our userspace team. I am trying to solve this now because any problems that result from this staleness will be very hard to debug and link back to stale stats. > > 2. What is acceptable staleness in the stats for your use-case? Again, unfortunately I am not sure, but right now it can be O(seconds) which is not acceptable as we have workloads querying the stats every 1s (and sometimes more frequently). > > 3. What is your use-case? A few use cases we have that may be affected by this: - System overhead: calculations using memory.usage and some stats from memory.stat. If one of them is fresh and the other one isn't we have an inconsistent view of the system. - Userspace OOM killing: We use some stats in memory.stat to gauge the amount of memory that will be freed by killing a task as sometimes memory.usage includes shared resources that wouldn't be freed anyway. - Proactive reclaim: we read memory.stat in a proactive reclaim feedback loop, stale stats may cause us to mistakenly think reclaim is ineffective and prematurely stop. > > 4. Does your use-case care about staleness of all the stats in > memory.stat or some specific stats? We have multiple use cases that can be affected by this, so I don't think there are some specific stats. I am also not aware of all possibly affected use cases. > > 5. If some specific stats in memory.stat, does it make sense to > decouple them from rstat and just pay the price up front to maintain > them accurately? > > Most importantly please please please be concise in your responses. I try, sometimes I am not sure how much detail is needed. Sorry about that :) > > I know I am going back on some of the previous agreements but this > whole locking back and forth has made in question the original > motivation. That's okay. Taking a step back, having flushing being indeterministic in this way is a time bomb in my opinion. Note that this also affects in-kernel flushers like reclaim or dirty isolation, which I suspect will be more affected by staleness. No one complained yet AFAICT, but I think it's a time bomb. The worst part is that if/when a problem happens, we won't be able to easily tell what went wrong.