Re: [PATCH 0/9 RFC] cgroup: separate rstat trees

Yosry Ahmed <yosryahmed@xxxxxxxxxx> · Wed, 15 Jan 2025 13:36:28 -0800

On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
>
> Hi Yosry,
>
> On 1/14/25 5:39 PM, Yosry Ahmed wrote:
> > On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
> >>
> >> Hi Michal,
> >>
> >> On 1/13/25 10:25 AM, Shakeel Butt wrote:
> >>> On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
> >>>> Hello JP.
> >>>>
> >>>> On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
> >>>>> I've been experimenting with these changes to allow for separate
> >>>>> updating/flushing of cgroup stats per-subsystem.
> >>>>
> >>>> Nice.
> >>>>
> >>>>> I reached a point where this started to feel stable in my local testing, so I
> >>>>> wanted to share and get feedback on this approach.
> >>>>
> >>>> The split is not straight-forwardly an improvement --
> >>>
> >>> The major improvement in my opinion is the performance isolation for
> >>> stats readers i.e. cpu stats readers do not need to flush memory stats.
> >>>
> >>>> there's at least
> >>>> higher memory footprint
> >>>
> >>> Yes this is indeed the case and JP, can you please give a ballmark on
> >>> the memory overhead?
> >>
> >> Yes, the trade-off is using more memory to allow for separate trees.
> >> With these patches the changes in allocated memory for the
> >> cgroup_rstat_cpu instances and their associated locks are:
> >> static
> >>          reduced by 58%
> >> dynamic
> >>          increased by 344%
> >>
> >> The threefold increase on the dynamic side is attributed to now having 3
> >> rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
> >> instead of originally just 1. The number will change if more subsystems
> >> start or stop using rstat in the future. Feel free to let me know if you
> >> would like to see the detailed breakdown of these values.
> >
> > What is the absolute per-CPU memory usage?
>
> This is what I calculate as the combined per-cpu usage.
> before:
>         one cgroup_rstat_cpu instance for every cgroup
>         sizeof(cgroup_rstat_cpu) * nr_cgroups
> after:
>         three cgroup_rstat_cpu instances for every cgroup + updater lock for
> every subsystem plus one for base stats
>         sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
>                 sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)
>
> Note that "every cgroup" includes the root cgroup. Also, 3 represents
> the number of current rstat clients: base stats, memory, and io
> (assuming all enabled).

On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes.
Ignoring the spinlock for a second because it doesn't scale with
cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a
moderately large machine with 256 CPUs and 100 cgroups for example
that's ~8MB.

>
> As I'm writing this, I realize I might need to include the bpf cgroups
> as a fourth client and include this in my testing.