This patch series proposes a tweak in the thread placement strategy of CFS to make a better use of cores running at a high frequency. It is the result of a published work at ATC'20, available here: https://www.usenix.org/conference/atc20/presentation/gouicern We address the frequency inversion problem that stems from new DVFS capabilities of CPUs where frequency can be different among cores of the same chip. On applications with a large number of fork/wait patterns, we see that CFS tends to place new threads on idle cores running at a low frequency, while cores where the parent is running are at a high frequency and become idle immediately after. This means that a low frequency core is used while a high frequency core becomes idle. A more detailed analysis of this behavior is available in the paper but here is small example with 2 cores c0 and c1 and 2 threads A and B. Legend: - means high frequency . means low frequency nothing means nothing is running. c0: A------fork()-wait() \ c1: B................-------- c0 becomes idle while at a high frequency and c1 becomes busy and executes B at a low frequency for a long time. The fix we propose is to delay the migration of threads in order to see if the parent's core can be used quickly before using another core. When choosing a core to place a new or waking up thread (in select_task_rq_fair()), we check if the destination core is idle or busy frequency-wise. If it is busy, we keep the current behavior and move the thread to this new core. If it is idle, we place the thread locally (parent's core on fork, waker's core on wakeup). We also arm a high-resolution timer that will trigger a migration of this thread to the new core chosen by the original algorithm of CFS. If the thread is able to run quickly on the local core, we cancel the timer (e.g. the parent thread calls the wait() syscall). Else, the thread will be moved to the core decided by CFS and run quickly. When the parent thread waits, the previous example becomes: c0: A------fork()-wait()-B--------------- \/ c1: If the parent thread keeps running, the example becomes: c0: A------fork()---------timer------------ \/ \ c1: B.............-------- In the first case, we use a high frequency core (c0) and let c1 be idle. In the second case, we loose a few us of execution time for B. There are two configurable parameters in this patch: - the frequency threshold above which we consider a core busy frequency-wise. By default, the threshold is set to 0, which disables the delayed migrations. On our test server (an Intel Xeon E7-8870 v4), an efficient value was the minimal frequency of the CPU. On another test machine (AMD Ryzen 5 3400G), it was a frequency a little bit higher than the minimal one. - the delay of the high-resolution timer. We choose a default value of 50 us since it is the mean delay we observed between a fork and a wait during a kernel build. Both parameters can be modified through files in /proc/sys/kernel/. Since this is kind of work in progress, we are still looking to find better ways to choose the best threshold and delay, as both values probably depend on the machine. In terms of performance, we benchmarked 60 applications on a 5.4 kernel at the time of the paper's publication. The detailed results are available in the paper but overall, we improve the performance of 23 applications, with a best improvement of 56%. Only 3 applications suffer large performance degradations (more than 5%), but the degradation is "small" (at most 8%). We also think there might be energy savings since we try to avoid waking idle cores. We did see small improvements on our machines, but nothing as good as in terms of performance (at most 5% improvement). On the current development kernel (on the sched/core branch, commit f166b111e049), we run a few of these benchmarks to confirm our results on an up-to-date kernel. We run the build of the kernel (complete and scheduler only), hackbench, 2 applications from the NAS benchmark suite and openssl. We select these applications because they had either very good or very bad results on the 5.4 kernel. We present the mean of 10 runs, with the powersave scaling governor and the intel_pstate driver in active mode. Benchmark | unit | 5.9-f166b111e049 | patched -------------+------+------------------+------------------ LOWER IS BETTER -------------+------+------------------+------------------ hackbench | s | 5.661 | 5.612 (- 0.9%) kbuild-sched | s | 7.370 | 6.309 (-14.4%) kbuild | s | 34.798 | 34.788 (- 0.0%) nas_ft.C | s | 10.080 | 10.012 (- 0.7%) nas_sp.B | s | 6.653 | 7.033 (+ 5.7%) -------------+------+------------------+------------------ HIGHER IS BETTER -------------+------+------------------+------------------ openssl |sign/s| 15086.89 | 15071.08 (- 0.1%) On this small subset, the only application negatively impacted application is an HPC workload, so it is less of a problem since HPC people usually bypass the scheduler altogether. We still have a large beneficial impact on the scheduler build that proeminently has the fork/wait pattern and frequency inversions. The first patch of the series is not specific to scheduling. It allows us (or anyone else) to use the cpufreq infrastructure at a different sampling rate without compromising the cpufreq subsystem and applications that depend on it. The main idea behind this patch series is to bring to light the frequency inversion problem that will become more and more prominent with new CPUs that feature per-core DVFS. The solution proposed is a first idea for solving this problem that still needs to be tested across more CPUs and with more applications. Redha Gouicem (3): cpufreq: x86: allow external frequency measures sched: core: x86: query frequency at each tick sched/fair: delay thread migration on fork/wakeup/exec arch/x86/kernel/cpu/aperfmperf.c | 31 ++++++++++++++++++++--- drivers/cpufreq/cpufreq.c | 5 ++++ include/linux/cpufreq.h | 1 + include/linux/sched.h | 4 +++ include/linux/sched/sysctl.h | 3 +++ kernel/sched/core.c | 43 ++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++- kernel/sched/sched.h | 7 ++++++ kernel/sysctl.c | 14 +++++++++++ 9 files changed, 145 insertions(+), 5 deletions(-) -- 2.28.0