On Tue, Aug 15, 2023 at 7:20 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > [...] > > The problem in (1) is that first of all it's a behavioral change, we > start having explicit staleness in the stats, and userspace needs to > adapt by explicitly requesting a flush. A node controller can be > enlightened to do so, but on a system with a lot of cgroups, if you > flush once explicitly and iterate through all cgroups, the flush will > be stale by the time you reach the last cgroup. Keep in mind there are > also users that read their own stats, figuring out which users need to > flush explicitly vs. read cached stats is a problem. I thought we covered the invalidity of the staleness argument. Similar staleness can happen today, so not strictly a behavioral change. We can change the time window and condition of the periodic flush to reduce the chance of staleness. Option 2 can also face staleness as well. > > Taking a step back, the total work that needs to be done does not > change with (2). A node controller iterating cgroups and reading their > stats will do the same amount of flushing, it will just be distributed > across multiple read syscalls, so shorter intervals in kernel space. You seem to be worried about the very fine grained staleness of the stats. So, for scenarios where stats of multi-level cgroups need to be read and the workload is continuously updating the stats, the total work can be much more. For example if we are reading stats of root and top level memcgs then potentially option 2 can flush the stats for the whole tree twice. > > There are also in-kernel flushers (e.g. reclaim and dirty throttling) > that will benefit from (2) by reading more accurate stats without > having to flush the entire tree. The behavior is currently > indeterministic, you may get fresh or stale stats, you may flush one > cgroup or 100 cgroups. > > I think with (2) we make less compromises in terms of accuracy and > determinism, and it's a less disruptive change to userspace. These options are not white and black and there can be something in between but let me be very clear on what I don't want and would NACK. I don't want a global sleepable lock which can be taken by potentially any application running on the system. We have seen similar global locks causing isolation and priority inversion issues in production. So, not another lock which needs to be taken under extreme condition (reading stats under OOM) by a high priority task (node controller) and might be held by a low priority task.