Re: [tip: sched/core] sched/eevdf: Curb wakeup-preemption

K Prateek Nayak <kprateek.nayak@xxxxxxx> · Mon, 21 Aug 2023 16:09:03 +0530

Hello Peter,

Sorry for being late to the party but couple of benchmarks are unhappy
(very!) with eevdf, even with this optimization. I'll leave the results
of testing on a dual socket 3rd Generation EPYC System (2 x 64C/128T)
running in NPS1 mode below.

tl;dr

- Hackbench with medium load, tbench when overloaded, and DeathStarBench
  are not a fan of EEVDF so far :(

- schbench, when system overloaded, sees great benefit in 99th%ile
  latency, but that is expected since the deadline is fixed to
  (vruntime + base_slice) but base_slice_ns is equal to the legacy
  min_granularity_ns in all cases. Some cases of unixbench see a good
  benefit too.

- Others seem perf neutral.

On 8/17/2023 8:40 PM, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     63304558ba5dcaaff9e052ee43cfdcc7f9c29e85
> Gitweb:        https://git.kernel.org/tip/63304558ba5dcaaff9e052ee43cfdcc7f9c29e85
> Author:        Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> AuthorDate:    Wed, 16 Aug 2023 15:40:59 +02:00
> Committer:     Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CommitterDate: Thu, 17 Aug 2023 17:07:07 +02:00
> 
> sched/eevdf: Curb wakeup-preemption
> 
> Mike and others noticed that EEVDF does like to over-schedule quite a
> bit -- which does hurt performance of a number of benchmarks /
> workloads.
> 
> In particular, what seems to cause over-scheduling is that when lag is
> of the same order (or larger) than the request / slice then placement
> will not only cause the task to be placed left of current, but also
> with a smaller deadline than current, which causes immediate
> preemption.
> 
> [ notably, lag bounds are relative to HZ ]
> 
> Mike suggested we stick to picking 'current' for as long as it's
> eligible to run, giving it uninterrupted runtime until it reaches
> parity with the pack.
> 
> Augment Mike's suggestion by only allowing it to exhaust it's initial
> request.
> 
> One random data point:
> 
> echo NO_RUN_TO_PARITY > /debug/sched/features
> perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
> 
> 	3,723,554        context-switches      ( +-  0.56% )
> 	9.5136 +- 0.0394 seconds time elapsed  ( +-  0.41% )
> 
> echo RUN_TO_PARITY > /debug/sched/features
> perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
> 
> 	2,556,535        context-switches      ( +-  0.51% )
> 	9.2427 +- 0.0302 seconds time elapsed  ( +-  0.33% )

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

base:		tip:sched/core at commit 752182b24bf4 ("Merge tag
		'v6.5-rc2' into sched/core, to pick up fixes")

eevdf:		tip:sched/core at commit c1fc6484e1fb ("sched/rt:
		sysctl_sched_rr_timeslice show default timeslice after
		reset")

eevdf_curb:	tip:sched/core at commit 63304558ba5d ("sched/eevdf:
		Curb wakeup-preemption")

o Benchmark Results

* - Regression
^ - Improvement

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          base[pct imp](CV)         eevdf[pct imp](CV)    eevdf-curb[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.51)     1.02 [ -1.69]( 1.89)     1.03 [ -2.54]( 2.42)
 2-groups     1.00 [ -0.00]( 1.63)     1.05 [ -4.68]( 2.04)     1.04 [ -3.75]( 1.25)  *
 4-groups     1.00 [ -0.00]( 1.80)     1.07 [ -7.47]( 2.38)     1.07 [ -6.81]( 1.68)  *
 8-groups     1.00 [ -0.00]( 1.43)     1.06 [ -6.22]( 1.52)     1.06 [ -6.43]( 1.32)  *
16-groups     1.00 [ -0.00]( 1.04)     1.01 [ -1.27]( 3.44)     1.02 [ -1.55]( 2.58)

==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:   base[pct imp](CV)         eevdf[pct imp](CV)    eevdf-curb[pct imp](CV)
    1     1.00 [  0.00]( 0.49)     1.01 [  0.97]( 0.18)     1.01 [  0.52]( 0.06)
    2     1.00 [  0.00]( 1.94)     1.02 [  2.36]( 0.63)     1.02 [  1.62]( 0.63)
    4     1.00 [  0.00]( 1.07)     1.00 [ -0.19]( 0.86)     1.01 [  0.76]( 1.19)
    8     1.00 [  0.00]( 1.41)     1.02 [  1.69]( 0.22)     1.01 [  1.48]( 0.73)
   16     1.00 [  0.00]( 1.31)     1.04 [  3.72]( 1.99)     1.05 [  4.67]( 1.36)
   32     1.00 [  0.00]( 5.31)     1.04 [  3.53]( 4.29)     1.05 [  4.52]( 2.21)
   64     1.00 [  0.00]( 3.08)     1.12 [ 12.12]( 1.71)     1.10 [ 10.19]( 3.06)
  128     1.00 [  0.00]( 1.54)     1.01 [  1.02]( 0.65)     0.98 [ -2.23]( 0.62)
  256     1.00 [  0.00]( 1.09)     0.95 [ -5.42]( 0.19)     0.92 [ -7.86]( 0.50)  *
  512     1.00 [  0.00]( 0.20)     0.91 [ -9.03]( 0.20)     0.90 [-10.25]( 0.29)  *
 1024     1.00 [  0.00]( 0.22)     0.88 [-12.47]( 0.29)     0.87 [-13.46]( 0.49)  *

==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:          base[pct imp](CV)         eevdf[pct imp](CV)    eevdf-curb[pct imp](CV)
 Copy     1.00 [  0.00]( 3.95)     1.00 [  0.03]( 4.32)     1.02 [  2.26]( 2.73)
Scale     1.00 [  0.00]( 8.33)     1.05 [  5.17]( 5.21)     1.05 [  4.80]( 5.48)
  Add     1.00 [  0.00]( 8.15)     1.05 [  4.50]( 6.25)     1.04 [  4.44]( 5.53)
Triad     1.00 [  0.00]( 3.11)     0.93 [ -6.55](10.74)     0.97 [ -2.86]( 7.14)

==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:          base[pct imp](CV)         eevdf[pct imp](CV)    eevdf-curb[pct imp](CV)
 Copy     1.00 [  0.00]( 0.95)     1.00 [  0.30]( 0.70)     1.00 [  0.30]( 1.08)
Scale     1.00 [  0.00]( 0.73)     0.97 [ -2.93]( 6.55)     1.00 [  0.15]( 0.82)
  Add     1.00 [  0.00]( 1.69)     0.98 [ -2.19]( 6.53)     1.01 [  0.88]( 1.08)
Triad     1.00 [  0.00]( 7.49)     1.02 [  2.02]( 6.66)     1.05 [  4.88]( 4.56)

==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:        base[pct imp](CV)         eevdf[pct imp](CV)    eevdf-curb[pct imp](CV)
 1-clients     1.00 [  0.00]( 1.07)     1.00 [  0.42]( 0.46)     1.01 [  1.02]( 0.70)
 2-clients     1.00 [  0.00]( 0.78)     1.00 [ -0.26]( 0.38)     1.00 [  0.40]( 0.92)
 4-clients     1.00 [  0.00]( 0.96)     1.01 [  0.77]( 0.72)     1.01 [  1.07]( 0.83)
 8-clients     1.00 [  0.00]( 0.53)     1.00 [ -0.30]( 0.98)     1.00 [  0.15]( 0.82)
16-clients     1.00 [  0.00]( 1.05)     1.00 [  0.22]( 0.70)     1.01 [  0.54]( 1.26)
32-clients     1.00 [  0.00]( 1.29)     1.00 [  0.12]( 0.74)     1.00 [  0.16]( 1.24)
64-clients     1.00 [  0.00]( 2.80)     1.00 [ -0.27]( 2.24)     1.00 [  0.32]( 3.06)
128-clients    1.00 [  0.00]( 1.57)     1.00 [ -0.42]( 1.72)     0.99 [ -0.63]( 1.64)
256-clients    1.00 [  0.00]( 3.85)     1.02 [  2.40]( 4.44)     1.00 [  0.45]( 3.71)
512-clients    1.00 [  0.00](45.83)     1.00 [  0.12](52.42)     0.97 [ -2.75](57.69) 

==================================================================
Test          : schbench (old)
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers: base[pct imp](CV)       eevdf[pct imp](CV)     eevdf-curb[pct imp](CV)
  1     1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 2.28)     1.00 [ -0.00]( 2.28)
  2     1.00 [ -0.00](11.27)     1.27 [-27.27]( 6.42)     1.14 [-13.64](11.02)  *
  4     1.00 [ -0.00]( 1.95)     1.00 [ -0.00]( 3.77)     0.93 [  6.67]( 4.22)
  8     1.00 [ -0.00]( 4.17)     1.03 [ -2.70](13.83)     0.95 [  5.41]( 1.63)
 16     1.00 [ -0.00]( 4.17)     0.98 [  2.08]( 4.37)     1.04 [ -4.17]( 3.53)
 32     1.00 [ -0.00]( 1.89)     1.00 [ -0.00]( 8.69)     0.96 [  3.70]( 5.14)
 64     1.00 [ -0.00]( 3.66)     1.03 [ -3.31]( 2.30)     1.06 [ -5.96]( 2.56)
128     1.00 [ -0.00]( 5.79)     0.85 [ 14.77](12.12)     0.97 [  3.15]( 6.76)  ^
256     1.00 [ -0.00]( 8.50)     0.15 [ 84.84](26.04)     0.17 [ 83.43]( 8.04)  ^
512     1.00 [ -0.00]( 2.01)     0.28 [ 72.09]( 5.62)     0.28 [ 72.35]( 3.48)  ^

==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================

		                        tip		        eevdf                   eevdf-curb
Hmean     unixbench-dhry2reg-1     41333812.04 (   0.00%)    41248390.97 (  -0.21%)    41576959.80 (   0.59%)
Hmean     unixbench-dhry2reg-512 6244993319.97 (   0.00%)  6239969914.15 (  -0.08%)  6223263669.12 (  -0.35%)
Amean     unixbench-syscall-1       2932426.17 (   0.00%)     2968518.27 *  -1.23%*     2923093.63 *   0.32%*
Amean     unixbench-syscall-512     7670057.70 (   0.00%)     7790656.20 *  -1.57%*     8300980.77 *   8.23%*   ^
Hmean     unixbench-pipe-1          2571551.92 (   0.00%)     2535689.01 *  -1.39%*     2472718.52 *  -3.84%*
Hmean     unixbench-pipe-512      366469338.93 (   0.00%)   361385055.25 *  -1.39%*   363215893.62 *  -0.89%*
Hmean     unixbench-spawn-1            4263.51 (   0.00%)        4506.26 *   5.69%*        4520.53 *   6.03%*   ^
Hmean     unixbench-spawn-512         67782.44 (   0.00%)       69380.09 *   2.36%*       69709.04 *   2.84%*
Hmean     unixbench-execl-1            3829.47 (   0.00%)        3824.57 (  -0.13%)        3835.20 (   0.15%)
Hmean     unixbench-execl-512         11929.77 (   0.00%)       12288.64 (   3.01%)       13096.25 *   9.78%*   ^

==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================

base			303129.00 (var: 0.68%)
eevdf			309589.33 (var: 1.41%)	(+2.13%)
eevdf-curb		303940.00 (var: 1.09%)	(+0.27%)

==================================================================
Test          : DeathStarBench
Units         : %diff of Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================

	       base	 eevdf	       eevdf_curb
1CCD		0%      -15.15% 	-16.55%
2CCD		0%      -13.80% 	-16.23%
4CCD		0%      -7.50%  	-10.11%
8CCD		0%      -3.42%  	-3.68%

--

I'll go back to profile hackbench, tbench, and DeathStarBench. Will keep
the thread updated of any findings. Let me know if you have any pointers
for the debug. I plan on using Chenyu's schedstats extension unless IBS
or idle-info show some obvious problems - Thank you Chenyu for sharing
the schedstats patch :)

> 
> Suggested-by: Mike Galbraith <umgwanakikbuti@xxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Link: https://lkml.kernel.org/r/20230816134059.GC982867@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> ---
>  kernel/sched/fair.c     | 12 ++++++++++++
>  kernel/sched/features.h |  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f496cef..0b7445c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -873,6 +873,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>  	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>  		curr = NULL;
>  
> +	/*
> +	 * Once selected, run a task until it either becomes non-eligible or
> +	 * until it gets a new slice. See the HACK in set_next_entity().
> +	 */
> +	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
> +		return curr;
> +
>  	while (node) {
>  		struct sched_entity *se = __node_2_se(node);
>  
> @@ -5167,6 +5174,11 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  		update_stats_wait_end_fair(cfs_rq, se);
>  		__dequeue_entity(cfs_rq, se);
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
> +		/*
> +		 * HACK, stash a copy of deadline at the point of pick in vlag,
> +		 * which isn't used until dequeue.
> +		 */
> +		se->vlag = se->deadline;
>  	}
>  
>  	update_stats_curr_start(cfs_rq, se);
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 61bcbf5..f770168 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -6,6 +6,7 @@
>   */
>  SCHED_FEAT(PLACE_LAG, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> +SCHED_FEAT(RUN_TO_PARITY, true)
>  
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed

--
Thanks and Regards,
Prateek