[..] > > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading > > > disabled, I ran 22 instances of netperf in parallel and got the > > > following numbers from averaging 20 runs: > > > > > > Base: 33076.5 mbps > > > Patched: 31410.1 mbps > > > > > > That's about 5% diff. I guess the number of iterations helps reduce > > > the noise? I am not sure. > > > > > > Please also keep in mind that in this case all netperf instances are > > > in the same cgroup and at a 4-level depth. I imagine in a practical > > > setup processes would be a little more spread out, which means less > > > common ancestors, so less contended atomic operations. > > > > > > (Resending the reply as I messed up the last one, was not in plain text) > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my > > experience. Here are the numbers: > > > > Base: 40198.0 mbps > > Patched: 38629.7 mbps > > > > The regression is reduced to ~3.9%. > > > > What's more interesting is that going from a level 2 cgroup to a level > > 4 cgroup is already a big hit with or without this patch: > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression) > > Patched: 38629.7 -> 31410.1 (~18.7% regression) > > > > So going from level 2 to 4 is already a significant regression for > > other reasons (e.g. hierarchical charging). This patch only makes it > > marginally worse. This puts the numbers more into perspective imo than > > comparing values at level 4. What do you think? > > I think it's reasonable. > > Especially comparing to how many cachelines we used to touch on the > write side when all flushing happened there. This looks like a good > trade-off to me. Thanks. Still wanting to figure out if this patch is what you suggested in our previous discussion [1], to add a Suggested-by if appropriate :) [1]https://lore.kernel.org/lkml/20230913153758.GB45543@xxxxxxxxxxx/