On Thu, Jul 18, 2024 at 08:11:32PM GMT, Yosry Ahmed wrote: > On Thu, Jul 18, 2024 at 5:41 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote: > > > > Hi Jesper, > > > > On Wed, Jul 17, 2024 at 06:36:28PM GMT, Jesper Dangaard Brouer wrote: > > > > > [...] > > > > > > > > > Looking at the production numbers for the time the lock is held for level 0: > > > > > > @locked_time_level[0]: > > > [4M, 8M) 623 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > > > [8M, 16M) 860 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > > [16M, 32M) 295 |@@@@@@@@@@@@@@@@@ | > > > [32M, 64M) 275 |@@@@@@@@@@@@@@@@ | > > > > > > > Is it possible to get the above histogram for other levels as well? I > > know this is 12 numa node machine, how many total CPUs are there? > > > > > The time is in nanosec, so M corresponds to ms (milliseconds). > > > > > > With 36 flushes per second (as shown earlier) this is a flush every > > > 27.7ms. It is not unreasonable (from above data) that the flush time > > > also spend 27ms, which means that we spend a full CPU second flushing. > > > That is spending too much time flushing. > > > > One idea to further reduce this time is more fine grained flush > > skipping. At the moment we either skip the whole flush or not. How > > about we make this decision per-cpu? We already have per-cpu updates > > data and if it is less than MEMCG_CHARGE_BATCH, skip flush on that cpu. > > Good idea. > > I think we would need a per-subsystem callback to decide whether we > want to flush the cgroup or not. This needs to happen in the core > rstat flushing code (not the memcg flushing code), as we need to make > sure we do not remove the cgroup from the per-cpu updated tree if we > don't flush it. Unless we have per-subsystem update tree, I don't think per-subsystem callback would work or we would be flushing all if any subsystem wants it. Anyways we can discuss when we have data that it really helps. > > More generally, I think we should be able to have a "force" flush API > that skips all optimizations and ensures that a flush occurs. I think > this will be needed in the cgroup_rstat_exit() path, where stats of a > cgroup being freed must be propagated to its parent, no matter how > insignificant they may be, to avoid inconsistencies. Agree.