On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote: > > Hi Yosry, > > On 1/14/25 5:39 PM, Yosry Ahmed wrote: > > On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote: > >> > >> Hi Michal, > >> > >> On 1/13/25 10:25 AM, Shakeel Butt wrote: > >>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote: > >>>> Hello JP. > >>>> > >>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@xxxxxxxxx> wrote: > >>>>> I've been experimenting with these changes to allow for separate > >>>>> updating/flushing of cgroup stats per-subsystem. > >>>> > >>>> Nice. > >>>> > >>>>> I reached a point where this started to feel stable in my local testing, so I > >>>>> wanted to share and get feedback on this approach. > >>>> > >>>> The split is not straight-forwardly an improvement -- > >>> > >>> The major improvement in my opinion is the performance isolation for > >>> stats readers i.e. cpu stats readers do not need to flush memory stats. > >>> > >>>> there's at least > >>>> higher memory footprint > >>> > >>> Yes this is indeed the case and JP, can you please give a ballmark on > >>> the memory overhead? > >> > >> Yes, the trade-off is using more memory to allow for separate trees. > >> With these patches the changes in allocated memory for the > >> cgroup_rstat_cpu instances and their associated locks are: > >> static > >> reduced by 58% > >> dynamic > >> increased by 344% > >> > >> The threefold increase on the dynamic side is attributed to now having 3 > >> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io), > >> instead of originally just 1. The number will change if more subsystems > >> start or stop using rstat in the future. Feel free to let me know if you > >> would like to see the detailed breakdown of these values. > > > > What is the absolute per-CPU memory usage? > > This is what I calculate as the combined per-cpu usage. > before: > one cgroup_rstat_cpu instance for every cgroup > sizeof(cgroup_rstat_cpu) * nr_cgroups > after: > three cgroup_rstat_cpu instances for every cgroup + updater lock for > every subsystem plus one for base stats > sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups + > sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1) > > Note that "every cgroup" includes the root cgroup. Also, 3 represents > the number of current rstat clients: base stats, memory, and io > (assuming all enabled). On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes. Ignoring the spinlock for a second because it doesn't scale with cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a moderately large machine with 256 CPUs and 100 cgroups for example that's ~8MB. > > As I'm writing this, I realize I might need to include the bpf cgroups > as a fourth client and include this in my testing.