[..] > > What I think we should be doing is either supporting multiple > > ongoing_flushers (by replacing the global variable with a per-cgroup > > flag, perhaps), or biting the bullet and start using a mutex to drop > > lock yielding. If there aren't any concerns beyond priority inversion > > (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I > > suggested could fix this issue (and also put a bound on the flushing > > latency in general)? > > > > The mutex_lock_timeout is an interesting idea, but not a primitive we > have available today, right? We don't, but Waiman said it shouldn't be difficult to add: https://lore.kernel.org/lkml/623e62c5-3045-4dca-9f2c-ed15b8d3bad8@xxxxxxxxxx/ > > p.s. I have 5 prod machines running with mutex change, and I've not > gotten any SRE complaints. That's nice! > > > > (I suppose the latter is preferrable) > > > >> > >> IMHO we should go back to only doing ongoing_flusher for root-cgroup. > >> There is really low chance of sub-trees flushes being concurrent enough > >> to benefit from this, and it cause issues and needs (ugly) race handling. > >> > >> Further more sub-tree flushing doesn't take as long time as level 0 > >> flushing, which further lower the chance of concurrent flushes. > > > > I agree that the race handling is not pretty, and we can try to > > improve the implementation or handle the race in other ways. However, > > I think that skipping flushes for subtrees is valuable. I am not sure > > about the cgroup arrangement in your use case, but we do have cgroups > > with a lot of tasks in them (and/or child cgroups). If there is memory > > pressure (or hit the cgroup limit), they all may go into direct > > reclaim concurrently, so skipping flushes could really be beneficial. > > > > Of course, if the difference in complexity is not justified, we can go > > back to only supporting root cgroups for ongoing_flusher for now. But > > as I mentioned in the v4 email thread, some of the complexity may > > still remain anyway as we have multiple root cgroups in v1. > > > > Having an incremental step with "only supporting root cgroups for > ongoing_flusher for now" is a good step forward IMHO. > As you could see in grafana plot, this would be a significant production > improvement on its own, as it avoids wasting CPU resources spinning on > the lock. I am not opposed to this at all. All I am saying is, if we need to handle most complexity anyway due to multiple root cgroups in v1, then might as well support subtrees too. > > Being able to have multiple root cgroups, due to in v1, does pose an > implementation problem. Only having a single root, would allow to have > a race-free cmpxchg scheme. > Would it be reasonable to only support v2? The rstat code has so far been v1/v2 agnostic AFAICT, and Google is still using cgroup v1, so I naturally prefer we keep supporting both v1 and v2 going forward. > If so, how can I match on this? cgroup_on_dfl() is basically testing if a cgroup is on v2, but I really want to keep v1 included if possible :/ > > >> > >> Let's get some quick data on flush times from production, to support my > >> claim: > > > > Thanks for the data. I agree that in general root flushes will be a > > lot more expensive than subtree flushes, but keep in mind that the > > data may look really different depends on the cgroup setup. As I > > mentioned, I think we should still support subtree flushes unless the > > difference in complexity is not justified. > > > > It would be really valuable if you could provide production data on the > lock-hold times, just like I did with below bpftrace script... > Is that possible, please? Unfortunately, we don't have the infrastructure to do this :/ But again, I am not objecting to only supporting root cgroups as ongoing flushers for now if there is a justifiable complexity difference. So this shouldn't be a blocker.