* kernel test robot <oliver.sang@xxxxxxxxx> wrote: > Hello, > > kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on: Thanks for the testing, this is useful! So I've tabulated the results into a much easier to read format: > | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression > | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression > | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement > | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds -5.3% improvement > | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement > | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds -3.5% improvement > | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement > | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression > | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression > | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement > | testcase: change | blogbench: blogbench.write_score -35.9% regression > | testcase: change | hackbench: hackbench.throughput -4.8% regression > | testcase: change | blogbench: blogbench.write_score -59.3% regression > | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression > | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement > | testcase: change | hackbench: hackbench.throughput 19.1% improvement > | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression And then sorted them along the regression/improvement axis: > | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression > | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression > | testcase: change | blogbench: blogbench.write_score -59.3% regression > | testcase: change | blogbench: blogbench.write_score -35.9% regression > | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression > | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression > | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression > | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression > | testcase: change | hackbench: hackbench.throughput -4.8% regression > | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds +5.3% improvement > | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds +3.5% improvement > | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement > | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement > | testcase: change | hackbench: hackbench.throughput 19.1% improvement > | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement > | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement > | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement Testing results notes: - the '+' denotes an inverted improvement. The mixing of signs in the output of the ktest robot is arguably confusing. - Any hope getting similar summary format by default? It's much more informative than just picking up the biggest regression, which wasn't even done correctly AFAICT. Summary: While there's a lot of improvements, it is primarily the nature of performance regressions that dictate the way forward: - stress-ng.sigsuspend.ops_per_sec regressions, -93%: Clearly signal delivery performance hurts from delayed preemption, but that should be straightforward to resolve, if we are willing to commit to adding a high-prio insta-wakeup variant API ... - stress-ng.exec.ops_per_sec -34% regression: Likewise this possibly expresses that it's better to immediately reschedule during exec() - but maybe it's more and reflects some unfavorable migration, as suggested by the NUMA locality figures: %change %stddev | \ 79317172 -34.2% 52217838 ± 3% numa-numastat.node0.local_node 79360983 -34.2% 52240348 ± 3% numa-numastat.node0.numa_hit 77971050 -33.2% 52068168 ± 3% numa-numastat.node1.local_node 78009071 -33.2% 52089987 ± 3% numa-numastat.node1.numa_hit 88287 -45.7% 47970 ± 2% vmstat.system.cs - 'blogbench' regression of -59%: It too has a very large reduction in context switches: %stddev %change %stddev \ | \ 30035 -49.7% 15097 ± 3% vmstat.system.cs 2243545 ± 2% -4.1% 2152228 blogbench.read_score 52412617 -28.3% 37571769 blogbench.time.file_system_outputs 2682930 -74.1% 694136 blogbench.time.involuntary_context_switches 2369329 -50.0% 1184098 ± 5% blogbench.time.voluntary_context_switches 5851 -35.9% 3752 ± 2% blogbench.write_score It's unclear to me what's happening with this one, just from these stats, but it's "write_score" that hurts most. - 'stress-ng.filename.ops_per_sec' regression of -19%: This test suffered from an *increase* in context-switching, and a large increase in CPU-idle: %stddev %change %stddev \ | \ 4641666 +19.5% 5545394 ± 2% cpuidle..usage 90589 ± 2% +70.5% 154471 ± 2% vmstat.system.cs 628439 -19.2% 507711 stress-ng.filename.ops 10317 -19.0% 8355 stress-ng.filename.ops_per_sec 171981 -59.7% 69333 ± 3% stress-ng.time.involuntary_context_switches 770691 ± 3% +200.9% 2319214 stress-ng.time.voluntary_context_switches Anyway, it's clear from these results that while many workloads hurt from our notion of wake-preemption, there's several ones that benefit from it, especially generic ones like phoronix-test-suite - which have no good way to turn off wakeup preemption (SCHED_BATCH might help though). One way to approach this would be to instead of always doing wakeup-preemption (our current default), we could turn it around and only use it when it is clearly beneficial - such as signal delivery, or exec(). The canonical way to solve this would be give *userspace* a way to signal that it's beneficial to preempt immediately, ie. yield(), but right now that interface is hurting tasks that only want to give other tasks a chance to run, without necessarily giving up their own right to run: se->deadline += calc_delta_fair(se->slice, se); Anyway, my patch is obviously a no-go as-is, and this clearly needs more work. Thanks, Ingo