Hello, On Mon, Jul 25, 2022 at 12:52:17PM -0400, Johannes Weiner wrote: > On Thu, Jul 21, 2022 at 12:04:38PM +0800, Chengming Zhou wrote: > > PSI accounts stalls for each cgroup separately and aggregates it > > at each level of the hierarchy. This may case non-negligible overhead > > for some workloads when under deep level of the hierarchy. > > > > commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable") > > make PSI to skip per-cgroup stall accounting, only account system-wide > > to avoid this each level overhead. > > > > For our use case, we also want leaf cgroup PSI accounted for userspace > > adjustment on that cgroup, apart from only system-wide management. > > I hear the overhead argument. But skipping accounting in intermediate > levels is a bit odd and unprecedented in the cgroup interface. Once we > do this, it's conceivable people would like to do the same thing for > other stats and accounting, like for instance memory.stat. > > Tejun, what are your thoughts on this? Given that PSI requires on-the-spot recursive accumulation unlike other stats, it can add quite a bit of overhead, so I'm sympathetic to the argument because PSI can't be made cheaper by kernel being better (or at least we don't know how to yet). That said, "leaf-only" feels really hacky to me. My memory is hazy but there's nothing preventing any cgroup from being skipped over when updating PSI states, right? The state count propagation is recursive but it's each task's state being propagated upwards not the child cgroup's, so we can skip over any cgroup arbitrarily. ie. we can at least turn off PSI reporting on any given cgroup without worrying about affecting others. Am I correct? Assuming the above isn't wrong, if we can figure out how we can re-enable it, which is more difficult as the counters need to be resynchronized with the current state, that'd be ideal. Then, we can just allow each cgroup to enable / disable PSI reporting dynamically as they see fit. Thanks. -- tejun