The following commit has been merged into the sched/core branch of tip: Commit-ID: 1e17fb8edc5ad6587e9303ccdebce853bc8cf30c Gitweb: https://git.kernel.org/tip/1e17fb8edc5ad6587e9303ccdebce853bc8cf30c Author: Clement Courbet <courbet@xxxxxxxxxx> AuthorDate: Wed, 03 Mar 2021 14:46:53 -08:00 Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx> CommitterDate: Wed, 10 Mar 2021 09:51:49 +01:00 sched: Optimize __calc_delta() A significant portion of __calc_delta() time is spent in the loop shifting a u64 by 32 bits. Use `fls` instead of iterating. This is ~7x faster on benchmarks. The generic `fls` implementation (`generic_fls`) is still ~4x faster than the loop. Architectures that have a better implementation will make use of it. For example, on x86 we get an additional factor 2 in speed without dedicated implementation. On GCC, the asm versions of `fls` are about the same speed as the builtin. On Clang, the versions that use fls are more than twice as slow as the builtin. This is because the way the `fls` function is written, clang puts the value in memory: https://godbolt.org/z/EfMbYe. This bug is filed at https://bugs.llvm.org/show_bug.cgi?idI406. ``` name cpu/op BM_Calc<__calc_delta_loop> 9.57ms Â=B112% BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113% BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113% BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112% BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113% BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115% BM_Calc<__calc_delta_builtin> 1.32ms Â=B111% ``` Signed-off-by: Clement Courbet <courbet@xxxxxxxxxx> Signed-off-by: Josh Don <joshdon@xxxxxxxxxx> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@xxxxxxxxxx --- kernel/sched/fair.c | 19 +++++++++++-------- kernel/sched/sched.h | 1 + 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f5d6541..2e2ab1e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -229,22 +229,25 @@ static void __update_inv_weight(struct load_weight *lw) static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw) { u64 fact = scale_load_down(weight); + u32 fact_hi = (u32)(fact >> 32); int shift = WMULT_SHIFT; + int fs; __update_inv_weight(lw); - if (unlikely(fact >> 32)) { - while (fact >> 32) { - fact >>= 1; - shift--; - } + if (unlikely(fact_hi)) { + fs = fls(fact_hi); + shift -= fs; + fact >>= fs; } fact = mul_u32_u32(fact, lw->inv_weight); - while (fact >> 32) { - fact >>= 1; - shift--; + fact_hi = (u32)(fact >> 32); + if (fact_hi) { + fs = fls(fact_hi); + shift -= fs; + fact >>= fs; } return mul_u64_u32_shr(delta_exec, fact, shift); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bb8bb06..d2e09a6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -36,6 +36,7 @@ #include <uapi/linux/sched/types.h> #include <linux/binfmts.h> +#include <linux/bitops.h> #include <linux/blkdev.h> #include <linux/compat.h> #include <linux/context_tracking.h>