Re: [PATCH 0/9 RFC] cgroup: separate rstat trees

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 1/15/25 1:36 PM, Yosry Ahmed wrote:
On Wed, Jan 15, 2025 at 11:39 AM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:

Hi Yosry,

On 1/14/25 5:39 PM, Yosry Ahmed wrote:
On Tue, Jan 14, 2025 at 5:33 PM JP Kobryn <inwardvessel@xxxxxxxxx> wrote:

Hi Michal,

On 1/13/25 10:25 AM, Shakeel Butt wrote:
On Wed, Jan 08, 2025 at 07:16:47PM +0100, Michal Koutný wrote:
Hello JP.

On Mon, Dec 23, 2024 at 05:13:53PM -0800, JP Kobryn <inwardvessel@xxxxxxxxx> wrote:
I've been experimenting with these changes to allow for separate
updating/flushing of cgroup stats per-subsystem.

Nice.

I reached a point where this started to feel stable in my local testing, so I
wanted to share and get feedback on this approach.

The split is not straight-forwardly an improvement --

The major improvement in my opinion is the performance isolation for
stats readers i.e. cpu stats readers do not need to flush memory stats.

there's at least
higher memory footprint

Yes this is indeed the case and JP, can you please give a ballmark on
the memory overhead?

Yes, the trade-off is using more memory to allow for separate trees.
With these patches the changes in allocated memory for the
cgroup_rstat_cpu instances and their associated locks are:
static
          reduced by 58%
dynamic
          increased by 344%

The threefold increase on the dynamic side is attributed to now having 3
rstat trees per cgroup (1 for base stats, 1 for memory, 1 for io),
instead of originally just 1. The number will change if more subsystems
start or stop using rstat in the future. Feel free to let me know if you
would like to see the detailed breakdown of these values.

What is the absolute per-CPU memory usage?

This is what I calculate as the combined per-cpu usage.
before:
         one cgroup_rstat_cpu instance for every cgroup
         sizeof(cgroup_rstat_cpu) * nr_cgroups
after:
         three cgroup_rstat_cpu instances for every cgroup + updater lock for
every subsystem plus one for base stats
         sizeof(cgroup_rstat_cpu) * 3 * nr_cgroups +
                 sizeof(spinlock_t) * (CGROUP_SUBSYS_COUNT + 1)

Note that "every cgroup" includes the root cgroup. Also, 3 represents
the number of current rstat clients: base stats, memory, and io
(assuming all enabled).

On a config I have at hand sizeof(cgroup_rstat_cpu) is 160 bytes.
Ignoring the spinlock for a second because it doesn't scale with
cgroups, that'd be an extra 320 * nr_cgroups * nr_cpus bytes. On a
moderately large machine with 256 CPUs and 100 cgroups for example
that's ~8MB.


Revisiting the cgroup_rstat_cpu struct, there might be an opportunity to save some memory here. This struct has several cgroup_base_stat fields that are not useful to the actual subsystems (memory, io). So I'm considering a scenario where the existing cgroup_rstat_cpu is used for the base stats while a new lighter struct is used for others that maintains compatibility with the rstat infrastructure.


As I'm writing this, I realize I might need to include the bpf cgroups
as a fourth client and include this in my testing.





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [Monitors]

  Powered by Linux