On Sat, Jan 18, 2025 at 12:24:33AM +0100, Frederic Weisbecker wrote: > hrtimers are migrated away from the dying CPU to any online target at > the CPUHP_AP_HRTIMERS_DYING stage in order not to delay bandwidth timers > handling tasks involved in the CPU hotplug forward progress. > > However wake ups can still be performed by the outgoing CPU after > CPUHP_AP_HRTIMERS_DYING. Those can result again in bandwidth timers > being armed. Depending on several considerations (crystal ball > power management based election, earliest timer already enqueued, timer > migration enabled or not), the target may eventually be the current > CPU even if offline. If that happens, the timer is eventually ignored. > > The most notable example is RCU which had to deal with each and every of > those wake-ups by deferring them to an online CPU, along with related > workarounds: > > _ e787644caf76 (rcu: Defer RCU kthreads wakeup when CPU is dying) > _ 9139f93209d1 (rcu/nocb: Fix RT throttling hrtimer armed from offline CPU) > _ f7345ccc62a4 (rcu/nocb: Fix rcuog wake-up from offline softirq) > > The problem isn't confined to RCU though as the stop machine kthread > (which runs CPUHP_AP_HRTIMERS_DYING) reports its completion at the end > of its work through cpu_stop_signal_done() and performs a wake up that > eventually arms the deadline server timer: > > WARNING: CPU: 94 PID: 588 at kernel/time/hrtimer.c:1086 hrtimer_start_range_ns+0x289/0x2d0 > CPU: 94 UID: 0 PID: 588 Comm: migration/94 Not tainted > Stopper: multi_cpu_stop+0x0/0x120 <- stop_machine_cpuslocked+0x66/0xc0 > RIP: 0010:hrtimer_start_range_ns+0x289/0x2d0 > Call Trace: > <TASK> > ? hrtimer_start_range_ns > start_dl_timer > enqueue_dl_entity > dl_server_start > enqueue_task_fair > enqueue_task > ttwu_do_activate > try_to_wake_up > complete > cpu_stopper_thread > smpboot_thread_fn > kthread > ret_from_fork > ret_from_fork_asm > </TASK> > > Instead of providing yet another bandaid to work around the situation, > fix it from hrtimers infrastructure instead: always migrate away a > timer to an online target whenever it is enqueued from an offline CPU. > > This will also allow to revert all the above RCU disgraceful hacks. > > Reported-by: Vlad Poenaru <vlad.wing@xxxxxxxxx> > Reported-by: Usama Arif <usamaarif642@xxxxxxxxx> > Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") > Closes: 20241213203739.1519801-1-usamaarif642@xxxxxxxxx > Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> This passes over-holiday testing rcutorture, so, perhaps redundantly: Tested-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > --- > include/linux/hrtimer_defs.h | 1 + > kernel/time/hrtimer.c | 92 +++++++++++++++++++++++++++++------- > 2 files changed, 75 insertions(+), 18 deletions(-) > > diff --git a/include/linux/hrtimer_defs.h b/include/linux/hrtimer_defs.h > index c3b4b7ed7c16..84a5045f80f3 100644 > --- a/include/linux/hrtimer_defs.h > +++ b/include/linux/hrtimer_defs.h > @@ -125,6 +125,7 @@ struct hrtimer_cpu_base { > ktime_t softirq_expires_next; > struct hrtimer *softirq_next_timer; > struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES]; > + call_single_data_t csd; > } ____cacheline_aligned; > > > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c > index 030426c8c944..082d4b687fa1 100644 > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -58,6 +58,8 @@ > #define HRTIMER_ACTIVE_SOFT (HRTIMER_ACTIVE_HARD << MASK_SHIFT) > #define HRTIMER_ACTIVE_ALL (HRTIMER_ACTIVE_SOFT | HRTIMER_ACTIVE_HARD) > > +static void retrigger_next_event(void *arg); > + > /* > * The timer bases: > * > @@ -111,7 +113,8 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) = > .clockid = CLOCK_TAI, > .get_time = &ktime_get_clocktai, > }, > - } > + }, > + .csd = CSD_INIT(retrigger_next_event, NULL) > }; > > static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = { > @@ -124,6 +127,14 @@ static const int hrtimer_clock_to_base_table[MAX_CLOCKS] = { > [CLOCK_TAI] = HRTIMER_BASE_TAI, > }; > > +static inline bool hrtimer_base_is_online(struct hrtimer_cpu_base *base) > +{ > + if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) > + return true; > + else > + return likely(base->online); > +} > + > /* > * Functions and macros which are different for UP/SMP systems are kept in a > * single place > @@ -183,27 +194,58 @@ struct hrtimer_clock_base *lock_hrtimer_base(const struct hrtimer *timer, > } > > /* > - * We do not migrate the timer when it is expiring before the next > - * event on the target cpu. When high resolution is enabled, we cannot > - * reprogram the target cpu hardware and we would cause it to fire > - * late. To keep it simple, we handle the high resolution enabled and > - * disabled case similar. > + * Check if the elected target is suitable considering its next > + * event and the hotplug state of the current CPU. > + * > + * If the elected target is remote and its next event is after the timer > + * to queue, then a remote reprogram is necessary. However there is no > + * guarantee the IPI handling the operation would arrive in time to meet > + * the high resolution deadline. In this case the local CPU becomes a > + * preferred target, unless it is offline. > + * > + * High and low resolution modes are handled the same way for simplicity. > * > * Called with cpu_base->lock of target cpu held. > */ > -static int > -hrtimer_check_target(struct hrtimer *timer, struct hrtimer_clock_base *new_base) > +static bool > +hrtimer_suitable_target(struct hrtimer *timer, > + struct hrtimer_clock_base *new_base, > + struct hrtimer_cpu_base *new_cpu_base, > + struct hrtimer_cpu_base *this_cpu_base) > { > ktime_t expires; > > + /* > + * The local CPU clockevent can be reprogrammed. Also get_target_base() > + * guarantees it is online. > + */ > + if (new_cpu_base == this_cpu_base) > + return true; > + > + /* > + * The offline local CPU can't be the default target if the > + * next remote target event is after this timer. Keep the > + * elected new base. An IPI will we issued to reprogram > + * it as a last resort. > + */ > + if (!hrtimer_base_is_online(this_cpu_base)) > + return true; > + > expires = ktime_sub(hrtimer_get_expires(timer), new_base->offset); > - return expires < new_base->cpu_base->expires_next; > + > + return expires >= new_base->cpu_base->expires_next; > } > > static inline > struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base, > int pinned) > { > + if (!hrtimer_base_is_online(base)) { > + int cpu = cpumask_any_and(cpu_online_mask, housekeeping_cpumask(HK_TYPE_TIMER)); > + > + return &per_cpu(hrtimer_bases, cpu); > + } > + > #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON) > if (static_branch_likely(&timers_migration_enabled) && !pinned) > return &per_cpu(hrtimer_bases, get_nohz_timer_target()); > @@ -254,8 +296,8 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, > raw_spin_unlock(&base->cpu_base->lock); > raw_spin_lock(&new_base->cpu_base->lock); > > - if (new_cpu_base != this_cpu_base && > - hrtimer_check_target(timer, new_base)) { > + if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, > + this_cpu_base)) { > raw_spin_unlock(&new_base->cpu_base->lock); > raw_spin_lock(&base->cpu_base->lock); > new_cpu_base = this_cpu_base; > @@ -264,8 +306,8 @@ switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base, > } > WRITE_ONCE(timer->base, new_base); > } else { > - if (new_cpu_base != this_cpu_base && > - hrtimer_check_target(timer, new_base)) { > + if (!hrtimer_suitable_target(timer, new_base, new_cpu_base, > + this_cpu_base)) { > new_cpu_base = this_cpu_base; > goto again; > } > @@ -716,8 +758,6 @@ static inline int hrtimer_is_hres_enabled(void) > return hrtimer_hres_enabled; > } > > -static void retrigger_next_event(void *arg); > - > /* > * Switch to high resolution mode > */ > @@ -1206,6 +1246,7 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, > u64 delta_ns, const enum hrtimer_mode mode, > struct hrtimer_clock_base *base) > { > + struct hrtimer_cpu_base *this_cpu_base = this_cpu_ptr(&hrtimer_bases); > struct hrtimer_clock_base *new_base; > bool force_local, first; > > @@ -1217,9 +1258,15 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, > * and enforce reprogramming after it is queued no matter whether > * it is the new first expiring timer again or not. > */ > - force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); > + force_local = base->cpu_base == this_cpu_base; > force_local &= base->cpu_base->next_timer == timer; > > + /* > + * Don't force local queuing if this enqueue happens on a unplugged > + * CPU after hrtimer_cpu_dying() has been invoked. > + */ > + force_local &= this_cpu_base->online; > + > /* > * Remove an active timer from the queue. In case it is not queued > * on the current CPU, make sure that remove_hrtimer() updates the > @@ -1249,8 +1296,17 @@ static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim, > } > > first = enqueue_hrtimer(timer, new_base, mode); > - if (!force_local) > - return first; > + if (!force_local) { > + if (hrtimer_base_is_online(this_cpu_base)) > + return first; > + > + if (first) { > + struct hrtimer_cpu_base *new_cpu_base = new_base->cpu_base; > + > + smp_call_function_single_async(new_cpu_base->cpu, &new_cpu_base->csd); > + } > + return 0; > + } > > /* > * Timer was forced to stay on the current CPU to avoid > -- > 2.46.0 >