On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> >
> > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > [...]
> > > > >
> > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a
> > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # ./netserver -6
> > > > >
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > -m 10K; done
> > > >
> > > > You are missing '&' at the end. Use something like below:
> > > >
> > > > #!/bin/bash
> > > > for i in {1..22}
> > > > do
> > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > done
> > > > wait
> > > >
> > >
> > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > >
> > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > and got an average from all instances for all runs to reduce noise:
> > >
> > > #!/bin/bash
> > >
> > > ITER=10
> > > NR_INSTANCES=36
> > >
> > > for i in $(seq $ITER); do
> > > echo "iteration $i"
> > > for j in $(seq $NR_INSTANCES); do
> > > echo "iteration $i" >> "out$j"
> > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > done
> > > wait
> > > done
> > >
> > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > >
> > > Base: 22169 mbps
> > > Patched: 21331.9 mbps
> > >
> > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > Perhaps it's the number of runs?
> >
> > My base kernel is next-20231009 and I am running experiments with
> > hyperthreading disabled.
>
> Using next-20231009 and a similar 44 core machine with hyperthreading
> disabled, I ran 22 instances of netperf in parallel and got the
> following numbers from averaging 20 runs:
>
> Base: 33076.5 mbps
> Patched: 31410.1 mbps
>
> That's about 5% diff. I guess the number of iterations helps reduce
> the noise? I am not sure.
>
> Please also keep in mind that in this case all netperf instances are
> in the same cgroup and at a 4-level depth. I imagine in a practical
> setup processes would be a little more spread out, which means less
> common ancestors, so less contended atomic operations.
I was curious, so I ran the same testing in a cgroup 2 levels deep (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my experience. Here are the numbers:
Base: 40198.0 mbps
Patched: 38629.7 mbps
The regression is reduced to ~3.9%.
What's more interesting is that going from a level 2 cgroup to a level 4 cgroup is already a big hit with or without this patch:
Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
Patched: 38629.7 -> 31410.1 (~18.7% regression)
So going from level 2 to 4 is already a significant regression for other reasons (e.g. hierarchical charging). This patch only makes it marginally worse. This puts the numbers more into perspective imo than comparing values at level 4. What do you think?