[..] > > > > Basically, I prefer that we don't skip flushing at all and keep > > userspace and in-kernel users the same. We can use completions to make > > other overlapping flushers sleep instead of spin on the lock. > > > > I think there are good reasons for skipping flushes for userspace when > reading these stats. More below. > > I'm looking at kernel code to spot cases where the flush MUST to be > completed before returning. There are clearly cases where we don't need > 100% accurate stats, evident by mem_cgroup_flush_stats_ratelimited() and > mem_cgroup_flush_stats() that use memcg_vmstats_needs_flush(). > > The cgroup_rstat_exit() call seems to depend on cgroup_rstat_flush() > being strict/accurate, because need to free the percpu resources. Yeah I think this one cannot be skipped. > > The obj_cgroup_may_zswap() have a comments that says it needs to get > accurate stats for charging. This one needs to be somewhat accurate to respect memcg limits. I am not sure how much inaccuracy we can tolerate. > > These were the two cases, I found, do you know of others? Nothing that screamed at me, but as I mentioned, the non-deterministic nature of this makes me uncomfortable and feels to me like a potential way to get subtle bugs. > > > > A proof of concept is basically something like: > > > > void cgroup_rstat_flush(cgroup) > > { > > if (cgroup_is_descendant(cgroup, READ_ONCE(cgroup_under_flush))) { > > wait_for_completion_interruptible(&cgroup_under_flush->completion); > > return; > > } > > This feels like what we would achieve by changing this to a mutex. The main difference is that whoever is holding the lock still cannot sleep, while waiters can (and more importantly, they don't disable interrupts). This is essentially a middle ground between a mutex and a lock. I think this dodges the priority inversion problem Shakeel described because a low priority job holding the lock cannot sleep. Is there an existing locking primitive that can achieve this? > > > > > __cgroup_rstat_lock(cgrp, -1); > > reinit_completion(&cgroup->completion); > > /* Any overlapping flush requests after this write will not spin > > on the lock */ > > WRITE_ONCE(cgroup_under_flush, cgroup); > > > > cgroup_rstat_flush_locked(cgrp); > > complete_all(&cgroup->completion); > > __cgroup_rstat_unlock(cgrp, -1); > > } > > > > There may be missing barriers or chances to reduce the window between > > __cgroup_rstat_lock and WRITE_ONCE(), but that's what I have in mind. > > I think it's not too complicated, but we need to check if it fixes the > > problem. > > > > If this is not preferable, then yeah, let's at least keep the > > userspace behavior intact. This makes sure we don't affect userspace > > negatively, and we can change it later as we please. > > I don't think userspace reading these stats need to be 100% accurate. > We are only reading the io.stat, memory.stat and cpu.stat every 53 > seconds. Reading cpu.stat print stats divided by NSEC_PER_USEC (1000). > > If userspace is reading these very often, then they will be killing the > system as it disables IRQs. > > On my prod system the flush of root cgroup can take 35 ms, which is not > good, but this inaccuracy should not matter for userspace. > > Please educate me on why we need accurate userspace stats? My point is not about accuracy, although I think it's a reasonable argument on its own (a lot of things could change in a short amount of time, which is why I prefer magnitude-based ratelimiting). My point is about logical ordering. If a userspace program reads the stats *after* an event occurs, it expects to get a snapshot of the system state after that event. Two examples are: - A proactive reclaimer reading the stats after a reclaim attempt to check if it needs to reclaim more memory or fallback. - A userspace OOM killer reading the stats after a usage spike to decide which workload to kill. I listed such examples with more detail in [1], when I removed stats_flush_ongoing from the memcg code. [1]https://lore.kernel.org/lkml/20231129032154.3710765-6-yosryahmed@xxxxxxxxxx/