Hello Peter, Sorry for being late to the party but couple of benchmarks are unhappy (very!) with eevdf, even with this optimization. I'll leave the results of testing on a dual socket 3rd Generation EPYC System (2 x 64C/128T) running in NPS1 mode below. tl;dr - Hackbench with medium load, tbench when overloaded, and DeathStarBench are not a fan of EEVDF so far :( - schbench, when system overloaded, sees great benefit in 99th%ile latency, but that is expected since the deadline is fixed to (vruntime + base_slice) but base_slice_ns is equal to the legacy min_granularity_ns in all cases. Some cases of unixbench see a good benefit too. - Others seem perf neutral. On 8/17/2023 8:40 PM, tip-bot2 for Peter Zijlstra wrote: > The following commit has been merged into the sched/core branch of tip: > > Commit-ID: 63304558ba5dcaaff9e052ee43cfdcc7f9c29e85 > Gitweb: https://git.kernel.org/tip/63304558ba5dcaaff9e052ee43cfdcc7f9c29e85 > Author: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > AuthorDate: Wed, 16 Aug 2023 15:40:59 +02:00 > Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > CommitterDate: Thu, 17 Aug 2023 17:07:07 +02:00 > > sched/eevdf: Curb wakeup-preemption > > Mike and others noticed that EEVDF does like to over-schedule quite a > bit -- which does hurt performance of a number of benchmarks / > workloads. > > In particular, what seems to cause over-scheduling is that when lag is > of the same order (or larger) than the request / slice then placement > will not only cause the task to be placed left of current, but also > with a smaller deadline than current, which causes immediate > preemption. > > [ notably, lag bounds are relative to HZ ] > > Mike suggested we stick to picking 'current' for as long as it's > eligible to run, giving it uninterrupted runtime until it reaches > parity with the pack. > > Augment Mike's suggestion by only allowing it to exhaust it's initial > request. > > One random data point: > > echo NO_RUN_TO_PARITY > /debug/sched/features > perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 > > 3,723,554 context-switches ( +- 0.56% ) > 9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% ) > > echo RUN_TO_PARITY > /debug/sched/features > perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000 > > 2,556,535 context-switches ( +- 0.51% ) > 9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% ) o System Details - 3rd Generation EPYC System - 2 x 64C/128T - NPS1 mode o Kernels base: tip:sched/core at commit 752182b24bf4 ("Merge tag 'v6.5-rc2' into sched/core, to pick up fixes") eevdf: tip:sched/core at commit c1fc6484e1fb ("sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset") eevdf_curb: tip:sched/core at commit 63304558ba5d ("sched/eevdf: Curb wakeup-preemption") o Benchmark Results * - Regression ^ - Improvement ================================================================== Test : hackbench Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV) 1-groups 1.00 [ -0.00]( 2.51) 1.02 [ -1.69]( 1.89) 1.03 [ -2.54]( 2.42) 2-groups 1.00 [ -0.00]( 1.63) 1.05 [ -4.68]( 2.04) 1.04 [ -3.75]( 1.25) * 4-groups 1.00 [ -0.00]( 1.80) 1.07 [ -7.47]( 2.38) 1.07 [ -6.81]( 1.68) * 8-groups 1.00 [ -0.00]( 1.43) 1.06 [ -6.22]( 1.52) 1.06 [ -6.43]( 1.32) * 16-groups 1.00 [ -0.00]( 1.04) 1.01 [ -1.27]( 3.44) 1.02 [ -1.55]( 2.58) ================================================================== Test : tbench Units : Normalized throughput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV) 1 1.00 [ 0.00]( 0.49) 1.01 [ 0.97]( 0.18) 1.01 [ 0.52]( 0.06) 2 1.00 [ 0.00]( 1.94) 1.02 [ 2.36]( 0.63) 1.02 [ 1.62]( 0.63) 4 1.00 [ 0.00]( 1.07) 1.00 [ -0.19]( 0.86) 1.01 [ 0.76]( 1.19) 8 1.00 [ 0.00]( 1.41) 1.02 [ 1.69]( 0.22) 1.01 [ 1.48]( 0.73) 16 1.00 [ 0.00]( 1.31) 1.04 [ 3.72]( 1.99) 1.05 [ 4.67]( 1.36) 32 1.00 [ 0.00]( 5.31) 1.04 [ 3.53]( 4.29) 1.05 [ 4.52]( 2.21) 64 1.00 [ 0.00]( 3.08) 1.12 [ 12.12]( 1.71) 1.10 [ 10.19]( 3.06) 128 1.00 [ 0.00]( 1.54) 1.01 [ 1.02]( 0.65) 0.98 [ -2.23]( 0.62) 256 1.00 [ 0.00]( 1.09) 0.95 [ -5.42]( 0.19) 0.92 [ -7.86]( 0.50) * 512 1.00 [ 0.00]( 0.20) 0.91 [ -9.03]( 0.20) 0.90 [-10.25]( 0.29) * 1024 1.00 [ 0.00]( 0.22) 0.88 [-12.47]( 0.29) 0.87 [-13.46]( 0.49) * ================================================================== Test : stream-10 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV) Copy 1.00 [ 0.00]( 3.95) 1.00 [ 0.03]( 4.32) 1.02 [ 2.26]( 2.73) Scale 1.00 [ 0.00]( 8.33) 1.05 [ 5.17]( 5.21) 1.05 [ 4.80]( 5.48) Add 1.00 [ 0.00]( 8.15) 1.05 [ 4.50]( 6.25) 1.04 [ 4.44]( 5.53) Triad 1.00 [ 0.00]( 3.11) 0.93 [ -6.55](10.74) 0.97 [ -2.86]( 7.14) ================================================================== Test : stream-100 Units : Normalized Bandwidth, MB/s Interpretation: Higher is better Statistic : HMean ================================================================== Test: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV) Copy 1.00 [ 0.00]( 0.95) 1.00 [ 0.30]( 0.70) 1.00 [ 0.30]( 1.08) Scale 1.00 [ 0.00]( 0.73) 0.97 [ -2.93]( 6.55) 1.00 [ 0.15]( 0.82) Add 1.00 [ 0.00]( 1.69) 0.98 [ -2.19]( 6.53) 1.01 [ 0.88]( 1.08) Triad 1.00 [ 0.00]( 7.49) 1.02 [ 2.02]( 6.66) 1.05 [ 4.88]( 4.56) ================================================================== Test : netperf Units : Normalized Througput Interpretation: Higher is better Statistic : AMean ================================================================== Clients: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV) 1-clients 1.00 [ 0.00]( 1.07) 1.00 [ 0.42]( 0.46) 1.01 [ 1.02]( 0.70) 2-clients 1.00 [ 0.00]( 0.78) 1.00 [ -0.26]( 0.38) 1.00 [ 0.40]( 0.92) 4-clients 1.00 [ 0.00]( 0.96) 1.01 [ 0.77]( 0.72) 1.01 [ 1.07]( 0.83) 8-clients 1.00 [ 0.00]( 0.53) 1.00 [ -0.30]( 0.98) 1.00 [ 0.15]( 0.82) 16-clients 1.00 [ 0.00]( 1.05) 1.00 [ 0.22]( 0.70) 1.01 [ 0.54]( 1.26) 32-clients 1.00 [ 0.00]( 1.29) 1.00 [ 0.12]( 0.74) 1.00 [ 0.16]( 1.24) 64-clients 1.00 [ 0.00]( 2.80) 1.00 [ -0.27]( 2.24) 1.00 [ 0.32]( 3.06) 128-clients 1.00 [ 0.00]( 1.57) 1.00 [ -0.42]( 1.72) 0.99 [ -0.63]( 1.64) 256-clients 1.00 [ 0.00]( 3.85) 1.02 [ 2.40]( 4.44) 1.00 [ 0.45]( 3.71) 512-clients 1.00 [ 0.00](45.83) 1.00 [ 0.12](52.42) 0.97 [ -2.75](57.69) ================================================================== Test : schbench (old) Units : Normalized 99th percentile latency in us Interpretation: Lower is better Statistic : Median ================================================================== #workers: base[pct imp](CV) eevdf[pct imp](CV) eevdf-curb[pct imp](CV) 1 1.00 [ -0.00]( 2.28) 1.00 [ -0.00]( 2.28) 1.00 [ -0.00]( 2.28) 2 1.00 [ -0.00](11.27) 1.27 [-27.27]( 6.42) 1.14 [-13.64](11.02) * 4 1.00 [ -0.00]( 1.95) 1.00 [ -0.00]( 3.77) 0.93 [ 6.67]( 4.22) 8 1.00 [ -0.00]( 4.17) 1.03 [ -2.70](13.83) 0.95 [ 5.41]( 1.63) 16 1.00 [ -0.00]( 4.17) 0.98 [ 2.08]( 4.37) 1.04 [ -4.17]( 3.53) 32 1.00 [ -0.00]( 1.89) 1.00 [ -0.00]( 8.69) 0.96 [ 3.70]( 5.14) 64 1.00 [ -0.00]( 3.66) 1.03 [ -3.31]( 2.30) 1.06 [ -5.96]( 2.56) 128 1.00 [ -0.00]( 5.79) 0.85 [ 14.77](12.12) 0.97 [ 3.15]( 6.76) ^ 256 1.00 [ -0.00]( 8.50) 0.15 [ 84.84](26.04) 0.17 [ 83.43]( 8.04) ^ 512 1.00 [ -0.00]( 2.01) 0.28 [ 72.09]( 5.62) 0.28 [ 72.35]( 3.48) ^ ================================================================== Test : Unixbench Units : Various, Throughput Interpretation: Higher is better Statistic : AMean, Hmean (Specified) ================================================================== tip eevdf eevdf-curb Hmean unixbench-dhry2reg-1 41333812.04 ( 0.00%) 41248390.97 ( -0.21%) 41576959.80 ( 0.59%) Hmean unixbench-dhry2reg-512 6244993319.97 ( 0.00%) 6239969914.15 ( -0.08%) 6223263669.12 ( -0.35%) Amean unixbench-syscall-1 2932426.17 ( 0.00%) 2968518.27 * -1.23%* 2923093.63 * 0.32%* Amean unixbench-syscall-512 7670057.70 ( 0.00%) 7790656.20 * -1.57%* 8300980.77 * 8.23%* ^ Hmean unixbench-pipe-1 2571551.92 ( 0.00%) 2535689.01 * -1.39%* 2472718.52 * -3.84%* Hmean unixbench-pipe-512 366469338.93 ( 0.00%) 361385055.25 * -1.39%* 363215893.62 * -0.89%* Hmean unixbench-spawn-1 4263.51 ( 0.00%) 4506.26 * 5.69%* 4520.53 * 6.03%* ^ Hmean unixbench-spawn-512 67782.44 ( 0.00%) 69380.09 * 2.36%* 69709.04 * 2.84%* Hmean unixbench-execl-1 3829.47 ( 0.00%) 3824.57 ( -0.13%) 3835.20 ( 0.15%) Hmean unixbench-execl-512 11929.77 ( 0.00%) 12288.64 ( 3.01%) 13096.25 * 9.78%* ^ ================================================================== Test : ycsb-mongodb Units : Throughput Interpretation: Higher is better Statistic : AMean ================================================================== base 303129.00 (var: 0.68%) eevdf 309589.33 (var: 1.41%) (+2.13%) eevdf-curb 303940.00 (var: 1.09%) (+0.27%) ================================================================== Test : DeathStarBench Units : %diff of Throughput Interpretation: Higher is better Statistic : AMean ================================================================== base eevdf eevdf_curb 1CCD 0% -15.15% -16.55% 2CCD 0% -13.80% -16.23% 4CCD 0% -7.50% -10.11% 8CCD 0% -3.42% -3.68% -- I'll go back to profile hackbench, tbench, and DeathStarBench. Will keep the thread updated of any findings. Let me know if you have any pointers for the debug. I plan on using Chenyu's schedstats extension unless IBS or idle-info show some obvious problems - Thank you Chenyu for sharing the schedstats patch :) > > Suggested-by: Mike Galbraith <umgwanakikbuti@xxxxxxxxx> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > Link: https://lkml.kernel.org/r/20230816134059.GC982867@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > --- > kernel/sched/fair.c | 12 ++++++++++++ > kernel/sched/features.h | 1 + > 2 files changed, 13 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f496cef..0b7445c 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -873,6 +873,13 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq) > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) > curr = NULL; > > + /* > + * Once selected, run a task until it either becomes non-eligible or > + * until it gets a new slice. See the HACK in set_next_entity(). > + */ > + if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline) > + return curr; > + > while (node) { > struct sched_entity *se = __node_2_se(node); > > @@ -5167,6 +5174,11 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > update_stats_wait_end_fair(cfs_rq, se); > __dequeue_entity(cfs_rq, se); > update_load_avg(cfs_rq, se, UPDATE_TG); > + /* > + * HACK, stash a copy of deadline at the point of pick in vlag, > + * which isn't used until dequeue. > + */ > + se->vlag = se->deadline; > } > > update_stats_curr_start(cfs_rq, se); > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index 61bcbf5..f770168 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -6,6 +6,7 @@ > */ > SCHED_FEAT(PLACE_LAG, true) > SCHED_FEAT(PLACE_DEADLINE_INITIAL, true) > +SCHED_FEAT(RUN_TO_PARITY, true) > > /* > * Prefer to schedule the task we woke last (assuming it failed -- Thanks and Regards, Prateek