Paul E. McKenney <paulmck@xxxxxxxxxx> writes: > On Sat, Sep 23, 2023 at 03:11:05AM +0200, Thomas Gleixner wrote: >> On Fri, Sep 22 2023 at 00:55, Thomas Gleixner wrote: >> > On Thu, Sep 21 2023 at 09:00, Linus Torvalds wrote: >> >> That said - I think as a proof of concept and "look, with this we get >> >> the expected scheduling event counts", that patch is perfect. I think >> >> you more than proved the concept. >> > >> > There is certainly quite some analyis work to do to make this a one to >> > one replacement. >> > >> > With a handful of benchmarks the PoC (tweaked with some obvious fixes) >> > is pretty much on par with the current mainline variants (NONE/FULL), >> > but the memtier benchmark makes a massive dent. >> > >> > It sports a whopping 10% regression with the LAZY mode versus the mainline >> > NONE model. Non-LAZY and FULL behave unsurprisingly in the same way. >> > >> > That benchmark is really sensitive to the preemption model. With current >> > mainline (DYNAMIC_PREEMPT enabled) the preempt=FULL model has ~20% >> > performance drop versus preempt=NONE. >> >> That 20% was a tired pilot error. The real number is in the 5% ballpark. >> >> > I have no clue what's going on there yet, but that shows that there is >> > obviously quite some work ahead to get this sorted. >> >> It took some head scratching to figure that out. The initial fix broke >> the handling of the hog issue, i.e. the problem that Ankur tried to >> solve, but I hacked up a "solution" for that too. >> >> With that the memtier benchmark is roughly back to the mainline numbers, >> but my throughput benchmark know how is pretty close to zero, so that >> should be looked at by people who actually understand these things. >> >> Likewise the hog prevention is just at the PoC level and clearly beyond >> my knowledge of scheduler details: It unconditionally forces a >> reschedule when the looping task is not responding to a lazy reschedule >> request before the next tick. IOW it forces a reschedule on the second >> tick, which is obviously different from the cond_resched()/might_sleep() >> behaviour. >> >> The changes vs. the original PoC aside of the bug and thinko fixes: >> >> 1) A hack to utilize the TRACE_FLAG_IRQS_NOSUPPORT flag to trace the >> lazy preempt bit as the trace_entry::flags field is full already. >> >> That obviously breaks the tracer ABI, but if we go there then >> this needs to be fixed. Steven? >> >> 2) debugfs file to validate that loops can be force preempted w/o >> cond_resched() >> >> The usage is: >> >> # taskset -c 1 bash >> # echo 1 > /sys/kernel/debug/sched/hog & >> # echo 1 > /sys/kernel/debug/sched/hog & >> # echo 1 > /sys/kernel/debug/sched/hog & >> >> top shows ~33% CPU for each of the hogs and tracing confirms that >> the crude hack in the scheduler tick works: >> >> bash-4559 [001] dlh2. 2253.331202: resched_curr <-__update_curr >> bash-4560 [001] dlh2. 2253.340199: resched_curr <-__update_curr >> bash-4561 [001] dlh2. 2253.346199: resched_curr <-__update_curr >> bash-4559 [001] dlh2. 2253.353199: resched_curr <-__update_curr >> bash-4561 [001] dlh2. 2253.358199: resched_curr <-__update_curr >> bash-4560 [001] dlh2. 2253.370202: resched_curr <-__update_curr >> bash-4559 [001] dlh2. 2253.378198: resched_curr <-__update_curr >> bash-4561 [001] dlh2. 2253.389199: resched_curr <-__update_curr >> >> The 'l' instead of the usual 'N' reflects that the lazy resched >> bit is set. That makes __update_curr() invoke resched_curr() >> instead of the lazy variant. resched_curr() sets TIF_NEED_RESCHED >> and folds it into preempt_count so that preemption happens at the >> next possible point, i.e. either in return from interrupt or at >> the next preempt_enable(). > > Belatedly calling out some RCU issues. Nothing fatal, just a > (surprisingly) few adjustments that will need to be made. The key thing > to note is that from RCU's viewpoint, with this change, all kernels > are preemptible, though rcu_read_lock() readers remain non-preemptible. Yeah, in Thomas' patch CONFIG_PREEMPTION=y and preemption models none/voluntary/full are just scheduler tweaks on top of that. And, so this would always have PREEMPT_RCU=y. So, shouldn't rcu_read_lock() readers be preemptible? (An alternate configuration might be: config PREEMPT_NONE select PREEMPT_COUNT config PREEMPT_FULL select PREEMPTION This probably allows for more configuration flexibility across archs? Would allow for TREE_RCU=y, for instance. That said, so far I've only been working with PREEMPT_RCU=y.) > With that: > > 1. As an optimization, given that preempt_count() would always give > good information, the scheduling-clock interrupt could sense RCU > readers for new-age CONFIG_PREEMPT_NONE=y kernels. As might the > IPI handlers for expedited grace periods. A nice optimization. > Except that... > > 2. The quiescent-state-forcing code currently relies on the presence > of cond_resched() in CONFIG_PREEMPT_RCU=n kernels. One fix > would be to do resched_cpu() more quickly, but some workloads > might not love the additional IPIs. Another approach to do #1 > above to replace the quiescent states from cond_resched() with > scheduler-tick-interrupt-sensed quiescent states. Right, the call to rcu_all_qs(). Just to see if I have it straight, something like this for PREEMPT_RCU=n kernels? if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) rcu_all_qs(); (Masked because PREEMPT_NONE might not do any folding for NEED_RESCHED_LAZY in the tick.) Though the comment around rcu_all_qs() mentions that rcu_all_qs() reports a quiescent state only if urgently needed. Given that the tick executes less frequently than calls to cond_resched(), could we just always report instead? Or I'm completely on the wrong track? if ((preempt_count() & ~PREEMPT_NEED_RESCHED) == 0) { preempt_disable(); rcu_qs(); preempt_enable(); } On your point about the preempt_count() being dependable, there's a wrinkle. As Linus mentions in https://lore.kernel.org/lkml/CAHk-=wgUimqtF7PqFfRw4Ju5H1KYkp6+8F=hBz7amGQ8GaGKkA@xxxxxxxxxxxxxx/, that might not be true for architectures that define ARCH_NO_PREEMPT. My plan was to limit those archs to do preemption only at user space boundary but there are almost certainly RCU implications that I missed. > Plus... > > 3. For nohz_full CPUs that run for a long time in the kernel, > there are no scheduling-clock interrupts. RCU reaches for > the resched_cpu() hammer a few jiffies into the grace period. > And it sets the ->rcu_urgent_qs flag so that the holdout CPU's > interrupt-entry code will re-enable its scheduling-clock interrupt > upon receiving the resched_cpu() IPI. > > So nohz_full CPUs should be OK as far as RCU is concerned. > Other subsystems might have other opinions. Ah, that's what I thought from my reading of the RCU comments. Good to have that confirmed. Thanks. > 4. As another optimization, kvfree_rcu() could unconditionally > check preempt_count() to sense a clean environment suitable for > memory allocation. Had missed this completely. Could you elaborate? > 5. Kconfig files with "select TASKS_RCU if PREEMPTION" must > instead say "select TASKS_RCU". This means that the #else > in include/linux/rcupdate.h that defines TASKS_RCU in terms of > vanilla RCU must go. There might be be some fallout if something > fails to select TASKS_RCU, builds only with CONFIG_PREEMPT_NONE=y, > and expects call_rcu_tasks(), synchronize_rcu_tasks(), or > rcu_tasks_classic_qs() do do something useful. Ack. > 6. You might think that RCU Tasks (as opposed to RCU Tasks Trace > or RCU Tasks Rude) would need those pesky cond_resched() calls > to stick around. The reason is that RCU Tasks readers are ended > only by voluntary context switches. This means that although a > preemptible infinite loop in the kernel won't inconvenience a > real-time task (nor an non-real-time task for all that long), > and won't delay grace periods for the other flavors of RCU, > it would indefinitely delay an RCU Tasks grace period. > > However, RCU Tasks grace periods seem to be finite in preemptible > kernels today, so they should remain finite in limited-preemptible > kernels tomorrow. Famous last words... > > 7. RCU Tasks Trace, RCU Tasks Rude, and SRCU shouldn't notice > any algorithmic difference from this change. So, essentially, as long as RCU tasks eventually, in the fullness of time, call schedule(), removing cond_resched() shouldn't have any effect :). > 8. As has been noted elsewhere, in this new limited-preemption > mode of operation, rcu_read_lock() readers remain preemptible. > This means that most of the CONFIG_PREEMPT_RCU #ifdefs remain. Ack. > 9. The rcu_preempt_depth() macro could do something useful in > limited-preemption kernels. Its current lack of ability in > CONFIG_PREEMPT_NONE=y kernels has caused trouble in the past. > > 10. The cond_resched_rcu() function must remain because we still > have non-preemptible rcu_read_lock() readers. For configurations with PREEMPT_RCU=n? Yes, agreed. Though it need only be this, right?: static inline void cond_resched_rcu(void) { #if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU) rcu_read_unlock(); rcu_read_lock(); #endif } > 11. My guess is that the IPVS_EST_TICK_CHAINS heuristic remains > unchanged, but I must defer to the include/net/ip_vs.h people. > > 12. I need to check with the BPF folks on the BPF verifier's > definition of BTF_ID(func, rcu_read_unlock_strict). > > 13. The kernel/locking/rtmutex.c file's rtmutex_spin_on_owner() > function might have some redundancy across the board instead > of just on CONFIG_PREEMPT_RCU=y. Or might not. I don't think I understand any of these well enough to comment. Will Cc the relevant folks when I send out the RFC. > 14. The kernel/trace/trace_osnoise.c file's run_osnoise() function > might need to do something for non-preemptible RCU to make > up for the lack of cond_resched() calls. Maybe just drop the > "IS_ENABLED()" and execute the body of the current "if" statement > unconditionally. Aah, yes this is a good idea. Thanks. > 15. I must defer to others on the mm/pgtable-generic.c file's > #ifdef that depends on CONFIG_PREEMPT_RCU. > > While in the area, I noted that KLP seems to depend on cond_resched(), > but on this I must defer to the KLP people. Yeah, as part of this work, I ended up unhooking most of the KLP hooks in cond_resched() and of course, cond_resched() itself. Will poke the livepatching people. > I am sure that I am missing something, but I have not yet seen any > show-stoppers. Just some needed adjustments. Appreciate this detailed list. Makes me think that everything might not go up in smoke after all! Thanks Ankur > Thoughts? > > Thanx, Paul > >> That's as much as I wanted to demonstrate and I'm not going to spend >> more cycles on it as I have already too many other things on flight and >> the resulting scheduler woes are clearly outside of my expertice. >> >> Though definitely I'm putting a permanent NAK in place for any attempts >> to duct tape the preempt=NONE model any further by sprinkling more >> cond*() and whatever warts around. >> >> Thanks, >> >> tglx >> --- >> arch/x86/Kconfig | 1 >> arch/x86/include/asm/thread_info.h | 6 ++-- >> drivers/acpi/processor_idle.c | 2 - >> include/linux/entry-common.h | 2 - >> include/linux/entry-kvm.h | 2 - >> include/linux/sched.h | 12 +++++--- >> include/linux/sched/idle.h | 8 ++--- >> include/linux/thread_info.h | 24 +++++++++++++++++ >> include/linux/trace_events.h | 8 ++--- >> kernel/Kconfig.preempt | 17 +++++++++++- >> kernel/entry/common.c | 4 +- >> kernel/entry/kvm.c | 2 - >> kernel/sched/core.c | 51 +++++++++++++++++++++++++------------ >> kernel/sched/debug.c | 19 +++++++++++++ >> kernel/sched/fair.c | 46 ++++++++++++++++++++++----------- >> kernel/sched/features.h | 2 + >> kernel/sched/idle.c | 3 -- >> kernel/sched/sched.h | 1 >> kernel/trace/trace.c | 2 + >> kernel/trace/trace_output.c | 16 ++++++++++- >> 20 files changed, 171 insertions(+), 57 deletions(-) >> >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -898,14 +898,15 @@ static inline void hrtick_rq_init(struct >> >> #if defined(CONFIG_SMP) && defined(TIF_POLLING_NRFLAG) >> /* >> - * Atomically set TIF_NEED_RESCHED and test for TIF_POLLING_NRFLAG, >> + * Atomically set TIF_NEED_RESCHED[_LAZY] and test for TIF_POLLING_NRFLAG, >> * this avoids any races wrt polling state changes and thereby avoids >> * spurious IPIs. >> */ >> -static inline bool set_nr_and_not_polling(struct task_struct *p) >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) >> { >> struct thread_info *ti = task_thread_info(p); >> - return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG); >> + >> + return !(fetch_or(&ti->flags, 1 << tif_bit) & _TIF_POLLING_NRFLAG); >> } >> >> /* >> @@ -922,7 +923,7 @@ static bool set_nr_if_polling(struct tas >> for (;;) { >> if (!(val & _TIF_POLLING_NRFLAG)) >> return false; >> - if (val & _TIF_NEED_RESCHED) >> + if (val & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) >> return true; >> if (try_cmpxchg(&ti->flags, &val, val | _TIF_NEED_RESCHED)) >> break; >> @@ -931,9 +932,9 @@ static bool set_nr_if_polling(struct tas >> } >> >> #else >> -static inline bool set_nr_and_not_polling(struct task_struct *p) >> +static inline bool set_nr_and_not_polling(struct task_struct *p, int tif_bit) >> { >> - set_tsk_need_resched(p); >> + set_tsk_thread_flag(p, tif_bit); >> return true; >> } >> >> @@ -1038,28 +1039,47 @@ void wake_up_q(struct wake_q_head *head) >> * might also involve a cross-CPU call to trigger the scheduler on >> * the target CPU. >> */ >> -void resched_curr(struct rq *rq) >> +static void __resched_curr(struct rq *rq, int lazy) >> { >> + int cpu, tif_bit = TIF_NEED_RESCHED + lazy; >> struct task_struct *curr = rq->curr; >> - int cpu; >> >> lockdep_assert_rq_held(rq); >> >> - if (test_tsk_need_resched(curr)) >> + if (unlikely(test_tsk_thread_flag(curr, tif_bit))) >> return; >> >> cpu = cpu_of(rq); >> >> if (cpu == smp_processor_id()) { >> - set_tsk_need_resched(curr); >> - set_preempt_need_resched(); >> + set_tsk_thread_flag(curr, tif_bit); >> + if (!lazy) >> + set_preempt_need_resched(); >> return; >> } >> >> - if (set_nr_and_not_polling(curr)) >> - smp_send_reschedule(cpu); >> - else >> + if (set_nr_and_not_polling(curr, tif_bit)) { >> + if (!lazy) >> + smp_send_reschedule(cpu); >> + } else { >> trace_sched_wake_idle_without_ipi(cpu); >> + } >> +} >> + >> +void resched_curr(struct rq *rq) >> +{ >> + __resched_curr(rq, 0); >> +} >> + >> +void resched_curr_lazy(struct rq *rq) >> +{ >> + int lazy = IS_ENABLED(CONFIG_PREEMPT_AUTO) && !sched_feat(FORCE_NEED_RESCHED) ? >> + TIF_NEED_RESCHED_LAZY_OFFSET : 0; >> + >> + if (lazy && unlikely(test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED))) >> + return; >> + >> + __resched_curr(rq, lazy); >> } >> >> void resched_cpu(int cpu) >> @@ -1132,7 +1152,7 @@ static void wake_up_idle_cpu(int cpu) >> if (cpu == smp_processor_id()) >> return; >> >> - if (set_nr_and_not_polling(rq->idle)) >> + if (set_nr_and_not_polling(rq->idle, TIF_NEED_RESCHED)) >> smp_send_reschedule(cpu); >> else >> trace_sched_wake_idle_without_ipi(cpu); >> @@ -8872,7 +8892,6 @@ static void __init preempt_dynamic_init( >> WARN_ON_ONCE(preempt_dynamic_mode == preempt_dynamic_undefined); \ >> return preempt_dynamic_mode == preempt_dynamic_##mode; \ >> } \ >> - EXPORT_SYMBOL_GPL(preempt_model_##mode) >> >> PREEMPT_MODEL_ACCESSOR(none); >> PREEMPT_MODEL_ACCESSOR(voluntary); >> --- a/include/linux/thread_info.h >> +++ b/include/linux/thread_info.h >> @@ -59,6 +59,16 @@ enum syscall_work_bit { >> >> #include <asm/thread_info.h> >> >> +#ifdef CONFIG_PREEMPT_AUTO >> +# define TIF_NEED_RESCHED_LAZY TIF_ARCH_RESCHED_LAZY >> +# define _TIF_NEED_RESCHED_LAZY _TIF_ARCH_RESCHED_LAZY >> +# define TIF_NEED_RESCHED_LAZY_OFFSET (TIF_NEED_RESCHED_LAZY - TIF_NEED_RESCHED) >> +#else >> +# define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED >> +# define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED >> +# define TIF_NEED_RESCHED_LAZY_OFFSET 0 >> +#endif >> + >> #ifdef __KERNEL__ >> >> #ifndef arch_set_restart_data >> @@ -185,6 +195,13 @@ static __always_inline bool tif_need_res >> (unsigned long *)(¤t_thread_info()->flags)); >> } >> >> +static __always_inline bool tif_need_resched_lazy(void) >> +{ >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && >> + arch_test_bit(TIF_NEED_RESCHED_LAZY, >> + (unsigned long *)(¤t_thread_info()->flags)); >> +} >> + >> #else >> >> static __always_inline bool tif_need_resched(void) >> @@ -193,6 +210,13 @@ static __always_inline bool tif_need_res >> (unsigned long *)(¤t_thread_info()->flags)); >> } >> >> +static __always_inline bool tif_need_resched_lazy(void) >> +{ >> + return IS_ENABLED(CONFIG_PREEMPT_AUTO) && >> + test_bit(TIF_NEED_RESCHED_LAZY, >> + (unsigned long *)(¤t_thread_info()->flags)); >> +} >> + >> #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ >> >> #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES >> --- a/kernel/Kconfig.preempt >> +++ b/kernel/Kconfig.preempt >> @@ -11,6 +11,13 @@ config PREEMPT_BUILD >> select PREEMPTION >> select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK >> >> +config PREEMPT_BUILD_AUTO >> + bool >> + select PREEMPT_BUILD >> + >> +config HAVE_PREEMPT_AUTO >> + bool >> + >> choice >> prompt "Preemption Model" >> default PREEMPT_NONE >> @@ -67,9 +74,17 @@ config PREEMPT >> embedded system with latency requirements in the milliseconds >> range. >> >> +config PREEMPT_AUTO >> + bool "Automagic preemption mode with runtime tweaking support" >> + depends on HAVE_PREEMPT_AUTO >> + select PREEMPT_BUILD_AUTO >> + help >> + Add some sensible blurb here >> + >> config PREEMPT_RT >> bool "Fully Preemptible Kernel (Real-Time)" >> depends on EXPERT && ARCH_SUPPORTS_RT >> + select PREEMPT_BUILD_AUTO if HAVE_PREEMPT_AUTO >> select PREEMPTION >> help >> This option turns the kernel into a real-time kernel by replacing >> @@ -95,7 +110,7 @@ config PREEMPTION >> >> config PREEMPT_DYNAMIC >> bool "Preemption behaviour defined on boot" >> - depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT >> + depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT && !PREEMPT_AUTO >> select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY >> select PREEMPT_BUILD >> default y if HAVE_PREEMPT_DYNAMIC_CALL >> --- a/include/linux/entry-common.h >> +++ b/include/linux/entry-common.h >> @@ -60,7 +60,7 @@ >> #define EXIT_TO_USER_MODE_WORK \ >> (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ >> _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ >> - ARCH_EXIT_TO_USER_MODE_WORK) >> + _TIF_NEED_RESCHED_LAZY | ARCH_EXIT_TO_USER_MODE_WORK) >> >> /** >> * arch_enter_from_user_mode - Architecture specific sanity check for user mode regs >> --- a/include/linux/entry-kvm.h >> +++ b/include/linux/entry-kvm.h >> @@ -18,7 +18,7 @@ >> >> #define XFER_TO_GUEST_MODE_WORK \ >> (_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ >> - _TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) >> + _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED_LAZY | ARCH_XFER_TO_GUEST_MODE_WORK) >> >> struct kvm_vcpu; >> >> --- a/kernel/entry/common.c >> +++ b/kernel/entry/common.c >> @@ -155,7 +155,7 @@ static unsigned long exit_to_user_mode_l >> >> local_irq_enable_exit_to_user(ti_work); >> >> - if (ti_work & _TIF_NEED_RESCHED) >> + if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) >> schedule(); >> >> if (ti_work & _TIF_UPROBE) >> @@ -385,7 +385,7 @@ void raw_irqentry_exit_cond_resched(void >> rcu_irq_exit_check_preempt(); >> if (IS_ENABLED(CONFIG_DEBUG_ENTRY)) >> WARN_ON_ONCE(!on_thread_stack()); >> - if (need_resched()) >> + if (test_tsk_need_resched(current)) >> preempt_schedule_irq(); >> } >> } >> --- a/kernel/sched/features.h >> +++ b/kernel/sched/features.h >> @@ -89,3 +89,5 @@ SCHED_FEAT(UTIL_EST_FASTUP, true) >> SCHED_FEAT(LATENCY_WARN, false) >> >> SCHED_FEAT(HZ_BW, true) >> + >> +SCHED_FEAT(FORCE_NEED_RESCHED, false) >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -2435,6 +2435,7 @@ extern void init_sched_fair_class(void); >> extern void reweight_task(struct task_struct *p, int prio); >> >> extern void resched_curr(struct rq *rq); >> +extern void resched_curr_lazy(struct rq *rq); >> extern void resched_cpu(int cpu); >> >> extern struct rt_bandwidth def_rt_bandwidth; >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2046,17 +2046,17 @@ static inline void update_tsk_thread_fla >> update_ti_thread_flag(task_thread_info(tsk), flag, value); >> } >> >> -static inline int test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) >> +static inline bool test_and_set_tsk_thread_flag(struct task_struct *tsk, int flag) >> { >> return test_and_set_ti_thread_flag(task_thread_info(tsk), flag); >> } >> >> -static inline int test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) >> +static inline bool test_and_clear_tsk_thread_flag(struct task_struct *tsk, int flag) >> { >> return test_and_clear_ti_thread_flag(task_thread_info(tsk), flag); >> } >> >> -static inline int test_tsk_thread_flag(struct task_struct *tsk, int flag) >> +static inline bool test_tsk_thread_flag(struct task_struct *tsk, int flag) >> { >> return test_ti_thread_flag(task_thread_info(tsk), flag); >> } >> @@ -2069,9 +2069,11 @@ static inline void set_tsk_need_resched( >> static inline void clear_tsk_need_resched(struct task_struct *tsk) >> { >> clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); >> + if (IS_ENABLED(CONFIG_PREEMPT_AUTO)) >> + clear_tsk_thread_flag(tsk, TIF_NEED_RESCHED_LAZY); >> } >> >> -static inline int test_tsk_need_resched(struct task_struct *tsk) >> +static inline bool test_tsk_need_resched(struct task_struct *tsk) >> { >> return unlikely(test_tsk_thread_flag(tsk,TIF_NEED_RESCHED)); >> } >> @@ -2252,7 +2254,7 @@ static inline int rwlock_needbreak(rwloc >> >> static __always_inline bool need_resched(void) >> { >> - return unlikely(tif_need_resched()); >> + return unlikely(tif_need_resched_lazy() || tif_need_resched()); >> } >> >> /* >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -964,8 +964,10 @@ static void clear_buddies(struct cfs_rq >> * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i >> * this is probably good enough. >> */ >> -static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) >> +static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se, bool tick) >> { >> + struct rq *rq = rq_of(cfs_rq); >> + >> if ((s64)(se->vruntime - se->deadline) < 0) >> return; >> >> @@ -984,10 +986,19 @@ static void update_deadline(struct cfs_r >> /* >> * The task has consumed its request, reschedule. >> */ >> - if (cfs_rq->nr_running > 1) { >> - resched_curr(rq_of(cfs_rq)); >> - clear_buddies(cfs_rq, se); >> + if (cfs_rq->nr_running < 2) >> + return; >> + >> + if (!IS_ENABLED(CONFIG_PREEMPT_AUTO) || sched_feat(FORCE_NEED_RESCHED)) { >> + resched_curr(rq); >> + } else { >> + /* Did the task ignore the lazy reschedule request? */ >> + if (tick && test_tsk_thread_flag(rq->curr, TIF_NEED_RESCHED_LAZY)) >> + resched_curr(rq); >> + else >> + resched_curr_lazy(rq); >> } >> + clear_buddies(cfs_rq, se); >> } >> >> #include "pelt.h" >> @@ -1095,7 +1106,7 @@ static void update_tg_load_avg(struct cf >> /* >> * Update the current task's runtime statistics. >> */ >> -static void update_curr(struct cfs_rq *cfs_rq) >> +static void __update_curr(struct cfs_rq *cfs_rq, bool tick) >> { >> struct sched_entity *curr = cfs_rq->curr; >> u64 now = rq_clock_task(rq_of(cfs_rq)); >> @@ -1122,7 +1133,7 @@ static void update_curr(struct cfs_rq *c >> schedstat_add(cfs_rq->exec_clock, delta_exec); >> >> curr->vruntime += calc_delta_fair(delta_exec, curr); >> - update_deadline(cfs_rq, curr); >> + update_deadline(cfs_rq, curr, tick); >> update_min_vruntime(cfs_rq); >> >> if (entity_is_task(curr)) { >> @@ -1136,6 +1147,11 @@ static void update_curr(struct cfs_rq *c >> account_cfs_rq_runtime(cfs_rq, delta_exec); >> } >> >> +static inline void update_curr(struct cfs_rq *cfs_rq) >> +{ >> + __update_curr(cfs_rq, false); >> +} >> + >> static void update_curr_fair(struct rq *rq) >> { >> update_curr(cfs_rq_of(&rq->curr->se)); >> @@ -5253,7 +5269,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc >> /* >> * Update run-time statistics of the 'current'. >> */ >> - update_curr(cfs_rq); >> + __update_curr(cfs_rq, true); >> >> /* >> * Ensure that runnable average is periodically updated. >> @@ -5267,7 +5283,7 @@ entity_tick(struct cfs_rq *cfs_rq, struc >> * validating it and just reschedule. >> */ >> if (queued) { >> - resched_curr(rq_of(cfs_rq)); >> + resched_curr_lazy(rq_of(cfs_rq)); >> return; >> } >> /* >> @@ -5413,7 +5429,7 @@ static void __account_cfs_rq_runtime(str >> * hierarchy can be throttled >> */ >> if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) >> - resched_curr(rq_of(cfs_rq)); >> + resched_curr_lazy(rq_of(cfs_rq)); >> } >> >> static __always_inline >> @@ -5673,7 +5689,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cf >> >> /* Determine whether we need to wake up potentially idle CPU: */ >> if (rq->curr == rq->idle && rq->cfs.nr_running) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } >> >> #ifdef CONFIG_SMP >> @@ -6378,7 +6394,7 @@ static void hrtick_start_fair(struct rq >> >> if (delta < 0) { >> if (task_current(rq, p)) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> return; >> } >> hrtick_start(rq, delta); >> @@ -8031,7 +8047,7 @@ static void check_preempt_wakeup(struct >> * prevents us from potentially nominating it as a false LAST_BUDDY >> * below. >> */ >> - if (test_tsk_need_resched(curr)) >> + if (need_resched()) >> return; >> >> /* Idle tasks are by definition preempted by non-idle tasks. */ >> @@ -8073,7 +8089,7 @@ static void check_preempt_wakeup(struct >> return; >> >> preempt: >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } >> >> #ifdef CONFIG_SMP >> @@ -12224,7 +12240,7 @@ static inline void task_tick_core(struct >> */ >> if (rq->core->core_forceidle_count && rq->cfs.nr_running == 1 && >> __entity_slice_used(&curr->se, MIN_NR_TASKS_DURING_FORCEIDLE)) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } >> >> /* >> @@ -12389,7 +12405,7 @@ prio_changed_fair(struct rq *rq, struct >> */ >> if (task_current(rq, p)) { >> if (p->prio > oldprio) >> - resched_curr(rq); >> + resched_curr_lazy(rq); >> } else >> check_preempt_curr(rq, p, 0); >> } >> --- a/drivers/acpi/processor_idle.c >> +++ b/drivers/acpi/processor_idle.c >> @@ -108,7 +108,7 @@ static const struct dmi_system_id proces >> */ >> static void __cpuidle acpi_safe_halt(void) >> { >> - if (!tif_need_resched()) { >> + if (!need_resched()) { >> raw_safe_halt(); >> raw_local_irq_disable(); >> } >> --- a/include/linux/sched/idle.h >> +++ b/include/linux/sched/idle.h >> @@ -63,7 +63,7 @@ static __always_inline bool __must_check >> */ >> smp_mb__after_atomic(); >> >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> >> static __always_inline bool __must_check current_clr_polling_and_test(void) >> @@ -76,7 +76,7 @@ static __always_inline bool __must_check >> */ >> smp_mb__after_atomic(); >> >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> >> #else >> @@ -85,11 +85,11 @@ static inline void __current_clr_polling >> >> static inline bool __must_check current_set_polling_and_test(void) >> { >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> static inline bool __must_check current_clr_polling_and_test(void) >> { >> - return unlikely(tif_need_resched()); >> + return unlikely(need_resched()); >> } >> #endif >> >> --- a/kernel/sched/idle.c >> +++ b/kernel/sched/idle.c >> @@ -57,8 +57,7 @@ static noinline int __cpuidle cpu_idle_p >> ct_cpuidle_enter(); >> >> raw_local_irq_enable(); >> - while (!tif_need_resched() && >> - (cpu_idle_force_poll || tick_check_broadcast_expired())) >> + while (!need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) >> cpu_relax(); >> raw_local_irq_disable(); >> >> --- a/kernel/trace/trace.c >> +++ b/kernel/trace/trace.c >> @@ -2722,6 +2722,8 @@ unsigned int tracing_gen_ctx_irq_test(un >> >> if (tif_need_resched()) >> trace_flags |= TRACE_FLAG_NEED_RESCHED; >> + if (tif_need_resched_lazy()) >> + trace_flags |= TRACE_FLAG_NEED_RESCHED_LAZY; >> if (test_preempt_need_resched()) >> trace_flags |= TRACE_FLAG_PREEMPT_RESCHED; >> return (trace_flags << 16) | (min_t(unsigned int, pc & 0xff, 0xf)) | >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -271,6 +271,7 @@ config X86 >> select HAVE_STATIC_CALL >> select HAVE_STATIC_CALL_INLINE if HAVE_OBJTOOL >> select HAVE_PREEMPT_DYNAMIC_CALL >> + select HAVE_PREEMPT_AUTO >> select HAVE_RSEQ >> select HAVE_RUST if X86_64 >> select HAVE_SYSCALL_TRACEPOINTS >> --- a/arch/x86/include/asm/thread_info.h >> +++ b/arch/x86/include/asm/thread_info.h >> @@ -81,8 +81,9 @@ struct thread_info { >> #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ >> #define TIF_SIGPENDING 2 /* signal pending */ >> #define TIF_NEED_RESCHED 3 /* rescheduling necessary */ >> -#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ >> -#define TIF_SSBD 5 /* Speculative store bypass disable */ >> +#define TIF_ARCH_RESCHED_LAZY 4 /* Lazy rescheduling */ >> +#define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/ >> +#define TIF_SSBD 6 /* Speculative store bypass disable */ >> #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ >> #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ >> #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ >> @@ -104,6 +105,7 @@ struct thread_info { >> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) >> #define _TIF_SIGPENDING (1 << TIF_SIGPENDING) >> #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) >> +#define _TIF_ARCH_RESCHED_LAZY (1 << TIF_ARCH_RESCHED_LAZY) >> #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) >> #define _TIF_SSBD (1 << TIF_SSBD) >> #define _TIF_SPEC_IB (1 << TIF_SPEC_IB) >> --- a/kernel/entry/kvm.c >> +++ b/kernel/entry/kvm.c >> @@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struc >> return -EINTR; >> } >> >> - if (ti_work & _TIF_NEED_RESCHED) >> + if (ti_work & (_TIF_NEED_RESCHED | TIF_NEED_RESCHED_LAZY)) >> schedule(); >> >> if (ti_work & _TIF_NOTIFY_RESUME) >> --- a/include/linux/trace_events.h >> +++ b/include/linux/trace_events.h >> @@ -178,8 +178,8 @@ unsigned int tracing_gen_ctx_irq_test(un >> >> enum trace_flag_type { >> TRACE_FLAG_IRQS_OFF = 0x01, >> - TRACE_FLAG_IRQS_NOSUPPORT = 0x02, >> - TRACE_FLAG_NEED_RESCHED = 0x04, >> + TRACE_FLAG_NEED_RESCHED = 0x02, >> + TRACE_FLAG_NEED_RESCHED_LAZY = 0x04, >> TRACE_FLAG_HARDIRQ = 0x08, >> TRACE_FLAG_SOFTIRQ = 0x10, >> TRACE_FLAG_PREEMPT_RESCHED = 0x20, >> @@ -205,11 +205,11 @@ static inline unsigned int tracing_gen_c >> >> static inline unsigned int tracing_gen_ctx_flags(unsigned long irqflags) >> { >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); >> + return tracing_gen_ctx_irq_test(0); >> } >> static inline unsigned int tracing_gen_ctx(void) >> { >> - return tracing_gen_ctx_irq_test(TRACE_FLAG_IRQS_NOSUPPORT); >> + return tracing_gen_ctx_irq_test(0); >> } >> #endif >> >> --- a/kernel/trace/trace_output.c >> +++ b/kernel/trace/trace_output.c >> @@ -460,17 +460,29 @@ int trace_print_lat_fmt(struct trace_seq >> (entry->flags & TRACE_FLAG_IRQS_OFF && bh_off) ? 'D' : >> (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' : >> bh_off ? 'b' : >> - (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' : >> + !IS_ENABLED(CONFIG_TRACE_IRQFLAGS_SUPPORT) ? 'X' : >> '.'; >> >> - switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | >> + switch (entry->flags & (TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | >> TRACE_FLAG_PREEMPT_RESCHED)) { >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: >> + need_resched = 'B'; >> + break; >> case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_PREEMPT_RESCHED: >> need_resched = 'N'; >> break; >> + case TRACE_FLAG_NEED_RESCHED_LAZY | TRACE_FLAG_PREEMPT_RESCHED: >> + need_resched = 'L'; >> + break; >> + case TRACE_FLAG_NEED_RESCHED | TRACE_FLAG_NEED_RESCHED_LAZY: >> + need_resched = 'b'; >> + break; >> case TRACE_FLAG_NEED_RESCHED: >> need_resched = 'n'; >> break; >> + case TRACE_FLAG_NEED_RESCHED_LAZY: >> + need_resched = 'l'; >> + break; >> case TRACE_FLAG_PREEMPT_RESCHED: >> need_resched = 'p'; >> break; >> --- a/kernel/sched/debug.c >> +++ b/kernel/sched/debug.c >> @@ -333,6 +333,23 @@ static const struct file_operations sche >> .release = seq_release, >> }; >> >> +static ssize_t sched_hog_write(struct file *filp, const char __user *ubuf, >> + size_t cnt, loff_t *ppos) >> +{ >> + unsigned long end = jiffies + 60 * HZ; >> + >> + for (; time_before(jiffies, end) && !signal_pending(current);) >> + cpu_relax(); >> + >> + return cnt; >> +} >> + >> +static const struct file_operations sched_hog_fops = { >> + .write = sched_hog_write, >> + .open = simple_open, >> + .llseek = default_llseek, >> +}; >> + >> static struct dentry *debugfs_sched; >> >> static __init int sched_init_debug(void) >> @@ -374,6 +391,8 @@ static __init int sched_init_debug(void) >> >> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops); >> >> + debugfs_create_file("hog", 0200, debugfs_sched, NULL, &sched_hog_fops); >> + >> return 0; >> } >> late_initcall(sched_init_debug); >> -- ankur