Re: [PATCH] perf/core: Introduce cpuctx->cgrp_ctx_list

Namhyung Kim <namhyung@xxxxxxxxxx> · Wed, 4 Oct 2023 08:01:14 -0700

Hi Ingo,

On Wed, Oct 4, 2023 at 12:26 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> * Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> > AFAIK we don't have a tool to measure the context switch overhead
> > directly.  (I think I should add one to perf ftrace latency).  But I can
> > see it with a simple perf bench command like this.
> >
> >   $ perf bench sched pipe -l 100000
> >   # Running 'sched/pipe' benchmark:
> >   # Executed 100000 pipe operations between two processes
> >
> >        Total time: 0.650 [sec]
> >
> >          6.505740 usecs/op
> >            153710 ops/sec
> >
> > It runs two tasks communicate each other using a pipe so it should
> > stress the context switch code.  This is the normal numbers on my
> > system.  But after I run these two perf stat commands in background,
> > the numbers vary a lot.
> >
> >   $ sudo perf stat -a -e cycles -G user.slice -- sleep 100000 &
> >   $ sudo perf stat -a -e uncore_imc/cas_count_read/ -- sleep 10000 &
> >
> > I will show the last two lines of perf bench sched pipe output for
> > three runs.
> >
> >         58.597060 usecs/op    # run 1
> >             17065 ops/sec
> >
> >         11.329240 usecs/op    # run 2
> >             88267 ops/sec
> >
> >         88.481920 usecs/op    # run 3
> >             11301 ops/sec
> >
> > I think the deviation comes from the fact that uncore events are managed
> > a certain number of cpus only.  If the target process runs on a cpu that
> > manages uncore pmu, it'd take longer.  Otherwise it won't affect the
> > performance much.
>
> The numbers of pipe-message context switching will vary a lot depending on
> CPU migration patterns as well.
>
> The best way to measure context-switch overhead is to pin that task
> to a single CPU with something like:
>
>    $ taskset 1 perf stat --null --repeat 10 perf bench sched pipe -l 10000 >/dev/null
>
>    Performance counter stats for 'perf bench sched pipe -l 10000' (10 runs):
>
>             0.049798 +- 0.000102 seconds time elapsed  ( +-  0.21% )
>
> as you can see the 0.21% stddev is pretty low.
>
> If we allow 2 CPUs, both runtime and stddev is much higher:
>
>    $ taskset 3 perf stat --null --repeat 10 perf bench sched pipe -l 10000 >/dev/null
>
>    Performance counter stats for 'perf bench sched pipe -l 10000' (10 runs):
>
>               1.4835 +- 0.0383 seconds time elapsed  ( +-  2.58% )

Thanks for taking your time.  I should have said I also tried this.
But the problem is that it doesn't need the pure context switch.
It needs to switch to a different cgroup to trigger the overhead.

For example, I counted the number of context switches.

  $ perf stat -e context-switches,cgroup-switches \
  > perf bench sched pipe -l 10000 > /dev/null

   Performance counter stats for 'perf bench sched pipe -l 10000':

            20,001      context-switches
            20,001      cgroup-switches

But if I use the taskset,

  $ perf stat -e context-switches,cgroup-switches \
  > taskset -c 0 perf bench sched pipe -l 10000 > /dev/null

   Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 10000':

            20,003      context-switches
                 2      cgroup-switches

So the regression didn't happen when I used taskset because
the two tasks run on the cpu without changing cgroups.

Maybe I can add an option to perf bench sched to place
senders and receivers in different cgroups.

Thanks,
Namhyung