On 9 Apr 2023 22:13:50 -0500 David Vernet <void@xxxxxxxxxxxxx> > > Hi Peter, > > I used the EEVDF scheduler to run workloads on one of Meta's largest > services (our main HHVM web server), and I wanted to share my > observations with you. Thanks for your testing. > > 3. Low latency + long slice are not mutually exclusive for us > > An interesting quality of web workloads running JIT engines is that they > require both low-latency, and long slices on the CPU. The reason we need > the tasks to be low latency is they're on the critical path for > servicing web requests (for most of their runtime, at least), and the > reasons we need them to have long slices are enumerated above -- they > thrash the icache / DSB / iTLB, more aggressive context switching causes > us to thrash on paging from disk, and in general, these tasks are on the > critical path for servicing web requests and we want to encourage them > to run to completion. > > This causes EEVDF to perform poorly for workloads with these > characteristics. If we decrease latency nice for our web workers then Take a look at the diff below. > they'll have lower latency, but only because their slices are smaller. > This in turn causes the increase in context switches, which causes the > thrashing described above. > > Worth noting -- I did try and increase the default base slice length by > setting sysctl_sched_base_slice to 35ms, and these were the results: > > With EEVDF slice 35ms and latency_nice 0 > ---------------------------------------- > - .5 - 2.25% drop in throughput > - 2.5 - 4.5% increase in p95 latencies > - 2.5 - 5.25% increase in p99 latencies > - Context switch per minute increase: 9.5 - 12.4% > - Involuntary context switch increase: ~320 - 330% > - Major fault delta: -3.6% to 37.6% > - IPC decrease .5 - .9% > > With EEVDF slice 35ms and latency_nice -8 for web workers > --------------------------------------------------------- > - .5 - 2.5% drop in throughput > - 1.7 - 4.75% increase in p95 latencies > - 2.5 - 5% increase in p99 latencies > - Context switch per minute increase: 10.5 - 15% > - Involuntary context switch increase: ~327 - 350% > - Major fault delta: -1% to 45% > - IPC decrease .4 - 1.1% > > I was expecting the increase in context switches and involuntary context > switches to be lower what than they ended up being with the increased > default slice length. Regardless, it still seems to tell a relatively > consistent story with the numbers from above. The improvement in IPC is > expected, though also less improved than I was anticipating (presumably > due to the still-high context switch rate). There were also fewer major > faults per minute compared to runs with a shorter default slice. > > Note that even if increasing the slice length did cause fewer context > switches and major faults, I still expect that it would hurt throughput > and latency for HHVM given that when latency-nicer tasks are eventually > given the CPU, the web workers will have to wait around for longer than > we'd like for those tasks to burn through their longer slices. > > In summary, I must admit that this patch set makes me a bit nervous. > Speaking for Meta at least, the patch set in its current form exceeds > the performance regressions (generally < .5% at the very most) that > we're able to tolerate in production. More broadly, it will certainly > cause us to have to carefully consider how it affects our model for > server capacity. > > Thanks, > David > In order to only narrow down the poor performance reported, make a tradeoff between runtime and latency simply by restoring sysctl_sched_min_granularity at tick preempt, given the known order on the runqueue. --- x/kernel/sched/fair.c +++ y/kernel/sched/fair.c @@ -5172,6 +5172,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { + unsigned int sysctl_sched_latency = 1000000ULL; + unsigned long delta_exec; + + delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; + if (delta_exec < sysctl_sched_latency) + return; if (pick_eevdf(cfs_rq) != curr) { resched_curr(rq_of(cfs_rq)); /*