On Tue, Jun 25, 2024 at 9:21 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > On Tue, Jun 25, 2024 at 09:00:03AM GMT, Yosry Ahmed wrote: > [...] > > > > My point is not about accuracy, although I think it's a reasonable > > argument on its own (a lot of things could change in a short amount of > > time, which is why I prefer magnitude-based ratelimiting). > > > > My point is about logical ordering. If a userspace program reads the > > stats *after* an event occurs, it expects to get a snapshot of the > > system state after that event. Two examples are: > > > > - A proactive reclaimer reading the stats after a reclaim attempt to > > check if it needs to reclaim more memory or fallback. > > - A userspace OOM killer reading the stats after a usage spike to > > decide which workload to kill. > > > > I listed such examples with more detail in [1], when I removed > > stats_flush_ongoing from the memcg code. > > > > [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/ > > You are kind of arbitrarily adding restrictions and rules here. Why not > follow the rules of a well established and battle tested stats infra > used by everyone i.e. vmstats? There is no sync flush and there are > frequent async flushes. I think that is what Jesper wants as well. That's how the memcg stats worked previously since before rstat and until the introduction of stats_flush_ongoing AFAICT. We saw an actual behavioral change when we were moving from a pre-rstat kernel to a kernel with stats_flush_ongoing. This was the rationale when I removed stats_flush_ongoing in [1]. It's not a new argument, I am just reiterating what we discussed back then. We saw an actual change in the proactive reclaimer as sometimes the stats read after the reclaim attempt did not reflect the actual state of the system. Sometimes the proactive reclaimer would back off when it shouldn't, because it thinks it didn't reclaim memory when it actually did. Upon further investigation, we realized that this could also affect the userspace OOM killer, because it uses the memcg stats to figure out which memcg will free most memory if it was killed (by looking at the anon stats, among others). If a memory usage spike occurs, and we read stats from before the spike, we may kill the wrong memcg. So as you said, we can experiment with in-kernel flushers, but let's keep userspace flushing consistent. Taking a step back, I just want to clarify that my arguments for the flushing changes, whether it's in this patch or your ratelimiting patch, are from a purely technical perspective. I am making suggestions that I believe may be better. I am not trying to stop any progress in this area or stand in the way. The only thing I really don't want is affecting userspace flushers as I described above. [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/