Re: [PATCH 00/17] sched: EEVDF using latency-nice

Hillf Danton <hdanton@xxxxxxxx> · Mon, 10 Apr 2023 16:23:07 +0800

On 9 Apr 2023 22:13:50 -0500 David Vernet <void@xxxxxxxxxxxxx>
> 
> Hi Peter,
> 
> I used the EEVDF scheduler to run workloads on one of Meta's largest
> services (our main HHVM web server), and I wanted to share my
> observations with you.

Thanks for your testing.
> 
> 3. Low latency + long slice are not mutually exclusive for us
> 
> An interesting quality of web workloads running JIT engines is that they
> require both low-latency, and long slices on the CPU. The reason we need
> the tasks to be low latency is they're on the critical path for
> servicing web requests (for most of their runtime, at least), and the
> reasons we need them to have long slices are enumerated above -- they
> thrash the icache / DSB / iTLB, more aggressive context switching causes
> us to thrash on paging from disk, and in general, these tasks are on the
> critical path for servicing web requests and we want to encourage them
> to run to completion.
> 
> This causes EEVDF to perform poorly for workloads with these
> characteristics. If we decrease latency nice for our web workers then

Take a look at the diff below.

> they'll have lower latency, but only because their slices are smaller.
> This in turn causes the increase in context switches, which causes the
> thrashing described above.
> 
> Worth noting -- I did try and increase the default base slice length by
> setting sysctl_sched_base_slice to 35ms, and these were the results:
> 
> With EEVDF slice 35ms and latency_nice 0
> ----------------------------------------
> - .5 - 2.25% drop in throughput
> - 2.5 - 4.5% increase in p95 latencies
> - 2.5 - 5.25% increase in p99 latencies
> - Context switch per minute increase: 9.5 - 12.4%
> - Involuntary context switch increase: ~320 - 330%
> - Major fault delta: -3.6% to 37.6%
> - IPC decrease .5 - .9%
> 
> With EEVDF slice 35ms and latency_nice -8 for web workers
> ---------------------------------------------------------
> - .5 - 2.5% drop in throughput
> - 1.7 - 4.75% increase in p95 latencies
> - 2.5 - 5% increase in p99 latencies
> - Context switch per minute increase: 10.5 - 15%
> - Involuntary context switch increase: ~327 - 350%
> - Major fault delta: -1% to 45%
> - IPC decrease .4 - 1.1%
> 
> I was expecting the increase in context switches and involuntary context
> switches to be lower what than they ended up being with the increased
> default slice length. Regardless, it still seems to tell a relatively
> consistent story with the numbers from above. The improvement in IPC is
> expected, though also less improved than I was anticipating (presumably
> due to the still-high context switch rate). There were also fewer major
> faults per minute compared to runs with a shorter default slice.
> 
> Note that even if increasing the slice length did cause fewer context
> switches and major faults, I still expect that it would hurt throughput
> and latency for HHVM given that when latency-nicer tasks are eventually
> given the CPU, the web workers will have to wait around for longer than
> we'd like for those tasks to burn through their longer slices.
> 
> In summary, I must admit that this patch set makes me a bit nervous.
> Speaking for Meta at least, the patch set in its current form exceeds
> the performance regressions (generally < .5% at the very most) that
> we're able to tolerate in production. More broadly, it will certainly
> cause us to have to carefully consider how it affects our model for
> server capacity.
> 
> Thanks,
> David
> 

In order to only narrow down the poor performance reported, make a tradeoff
between runtime and latency simply by restoring sysctl_sched_min_granularity
at tick preempt, given the known order on the runqueue.

--- x/kernel/sched/fair.c
+++ y/kernel/sched/fair.c
@@ -5172,6 +5172,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
+	unsigned int sysctl_sched_latency = 1000000ULL;
+	unsigned long delta_exec;
+
+	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+	if (delta_exec < sysctl_sched_latency)
+		return;
 	if (pick_eevdf(cfs_rq) != curr) {
 		resched_curr(rq_of(cfs_rq));
 		/*