Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks

Ingo Molnar <mingo@xxxxxxxxxx> · Mon, 25 Sep 2023 13:07:08 +0200

* kernel test robot <oliver.sang@xxxxxxxxx> wrote:

> Hello,
> 
> kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on:

Thanks for the testing, this is useful!

So I've tabulated the results into a much easier to read format:

> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec                                      -19.0% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec                                        -6.0% regression 
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec                                          17.6% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds    -5.3% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec                              11.5% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds        -3.5% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec                                         100.2% improvement
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -82.1% regression
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec                                           59.4% improvement
> | testcase: change | blogbench: blogbench.write_score                                               -35.9% regression
> | testcase: change | hackbench: hackbench.throughput                                                 -4.8% regression
> | testcase: change | blogbench: blogbench.write_score                                               -59.3% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec                                          -34.6% regression
> | testcase: change | netperf: netperf.Throughput_Mbps                                                60.6% improvement
> | testcase: change | hackbench: hackbench.throughput                                                 19.1% improvement
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec                                       -15.7% regression

And then sorted them along the regression/improvement axis:

> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec                                    -82.1% regression
> | testcase: change | blogbench: blogbench.write_score                                               -59.3% regression
> | testcase: change | blogbench: blogbench.write_score                                               -35.9% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec                                          -34.6% regression
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec                                      -19.0% regression
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec                                       -15.7% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec                                        -6.0% regression
> | testcase: change | hackbench: hackbench.throughput                                                 -4.8% regression
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds    +5.3% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds        +3.5% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec                              11.5% improvement
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec                                          17.6% improvement
> | testcase: change | hackbench: hackbench.throughput                                                 19.1% improvement
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec                                           59.4% improvement
> | testcase: change | netperf: netperf.Throughput_Mbps                                                60.6% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec                                         100.2% improvement

Testing results notes:

    - the '+' denotes an inverted improvement. The mixing of signs in the output of the 
      ktest robot is arguably confusing.

    - Any hope getting similar summary format by default? It's much more informative than 
      just picking up the biggest regression, which wasn't even done correctly AFAICT.

Summary:

While there's a lot of improvements, it is primarily the nature of performance
regressions that dictate the way forward:

 - stress-ng.sigsuspend.ops_per_sec regressions, -93%:

    Clearly signal delivery performance hurts from delayed preemption, but
    that should be straightforward to resolve, if we are willing to commit
    to adding a high-prio insta-wakeup variant API ...

 - stress-ng.exec.ops_per_sec -34% regression:

    Likewise this possibly expresses that it's better to immediately reschedule
    during exec() - but maybe it's more and reflects some unfavorable migration,
    as suggested by the NUMA locality figures:

                     %change         %stddev
                        |                \                                         
  79317172           -34.2%   52217838 ±  3%  numa-numastat.node0.local_node
  79360983           -34.2%   52240348 ±  3%  numa-numastat.node0.numa_hit                            
  77971050           -33.2%   52068168 ±  3%  numa-numastat.node1.local_node
  78009071           -33.2%   52089987 ±  3%  numa-numastat.node1.numa_hit
     88287           -45.7%      47970 ±  2%  vmstat.system.cs

 - 'blogbench' regression of -59%:

    It too has a very large reduction in context switches:

         %stddev     %change         %stddev
             \          |                \  
     30035           -49.7%      15097 ±  3%  vmstat.system.cs
   2243545 ±  2%      -4.1%    2152228        blogbench.read_score
  52412617           -28.3%   37571769        blogbench.time.file_system_outputs
   2682930           -74.1%     694136        blogbench.time.involuntary_context_switches
   2369329           -50.0%    1184098 ±  5%  blogbench.time.voluntary_context_switches
      5851           -35.9%       3752 ±  2%  blogbench.write_score

    It's unclear to me what's happening with this one, just from these stats,
    but it's "write_score" that hurts most.

 - 'stress-ng.filename.ops_per_sec' regression of -19%:

    This test suffered from an *increase* in context-switching, and a large
    increase in CPU-idle:

         %stddev     %change         %stddev
             \          |                \  
   4641666           +19.5%    5545394 ±  2%  cpuidle..usage
     90589 ±  2%     +70.5%     154471 ±  2%  vmstat.system.cs
    628439           -19.2%     507711        stress-ng.filename.ops
     10317           -19.0%       8355        stress-ng.filename.ops_per_sec

    171981           -59.7%      69333 ±  3%  stress-ng.time.involuntary_context_switches
    770691 ±  3%    +200.9%    2319214        stress-ng.time.voluntary_context_switches

Anyway, it's clear from these results that while many workloads hurt
from our notion of wake-preemption, there's several ones that benefit
from it, especially generic ones like phoronix-test-suite - which have
no good way to turn off wakeup preemption (SCHED_BATCH might help though).

One way to approach this would be to instead of always doing
wakeup-preemption (our current default), we could turn it around and
only use it when it is clearly beneficial - such as signal delivery,
or exec().

The canonical way to solve this would be give *userspace* a way to
signal that it's beneficial to preempt immediately, ie. yield(),
but right now that interface is hurting tasks that only want to
give other tasks a chance to run, without necessarily giving up
their own right to run:

        se->deadline += calc_delta_fair(se->slice, se);

Anyway, my patch is obviously a no-go as-is, and this clearly needs more work.

Thanks,

	Ingo