Hi Ingo, On Wed, Oct 4, 2023 at 12:26 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote: > > > * Namhyung Kim <namhyung@xxxxxxxxxx> wrote: > > > AFAIK we don't have a tool to measure the context switch overhead > > directly. (I think I should add one to perf ftrace latency). But I can > > see it with a simple perf bench command like this. > > > > $ perf bench sched pipe -l 100000 > > # Running 'sched/pipe' benchmark: > > # Executed 100000 pipe operations between two processes > > > > Total time: 0.650 [sec] > > > > 6.505740 usecs/op > > 153710 ops/sec > > > > It runs two tasks communicate each other using a pipe so it should > > stress the context switch code. This is the normal numbers on my > > system. But after I run these two perf stat commands in background, > > the numbers vary a lot. > > > > $ sudo perf stat -a -e cycles -G user.slice -- sleep 100000 & > > $ sudo perf stat -a -e uncore_imc/cas_count_read/ -- sleep 10000 & > > > > I will show the last two lines of perf bench sched pipe output for > > three runs. > > > > 58.597060 usecs/op # run 1 > > 17065 ops/sec > > > > 11.329240 usecs/op # run 2 > > 88267 ops/sec > > > > 88.481920 usecs/op # run 3 > > 11301 ops/sec > > > > I think the deviation comes from the fact that uncore events are managed > > a certain number of cpus only. If the target process runs on a cpu that > > manages uncore pmu, it'd take longer. Otherwise it won't affect the > > performance much. > > The numbers of pipe-message context switching will vary a lot depending on > CPU migration patterns as well. > > The best way to measure context-switch overhead is to pin that task > to a single CPU with something like: > > $ taskset 1 perf stat --null --repeat 10 perf bench sched pipe -l 10000 >/dev/null > > Performance counter stats for 'perf bench sched pipe -l 10000' (10 runs): > > 0.049798 +- 0.000102 seconds time elapsed ( +- 0.21% ) > > as you can see the 0.21% stddev is pretty low. > > If we allow 2 CPUs, both runtime and stddev is much higher: > > $ taskset 3 perf stat --null --repeat 10 perf bench sched pipe -l 10000 >/dev/null > > Performance counter stats for 'perf bench sched pipe -l 10000' (10 runs): > > 1.4835 +- 0.0383 seconds time elapsed ( +- 2.58% ) Thanks for taking your time. I should have said I also tried this. But the problem is that it doesn't need the pure context switch. It needs to switch to a different cgroup to trigger the overhead. For example, I counted the number of context switches. $ perf stat -e context-switches,cgroup-switches \ > perf bench sched pipe -l 10000 > /dev/null Performance counter stats for 'perf bench sched pipe -l 10000': 20,001 context-switches 20,001 cgroup-switches But if I use the taskset, $ perf stat -e context-switches,cgroup-switches \ > taskset -c 0 perf bench sched pipe -l 10000 > /dev/null Performance counter stats for 'taskset -c 0 perf bench sched pipe -l 10000': 20,003 context-switches 2 cgroup-switches So the regression didn't happen when I used taskset because the two tasks run on the cpu without changing cgroups. Maybe I can add an option to perf bench sched to place senders and receivers in different cgroups. Thanks, Namhyung