On Thu, Aug 12 2021 at 22:32, Thomas Gleixner wrote: > Since the recent consoliation of reprogramming functions, > hrtimer_force_reprogram() is affected by a check whether the new expiry > time is past the current expiry time. > > This breaks the NOHZ logic as that relies on the fact that the tick hrtimer > is moved into the future. That means cpu_base->expires_next becomes stale > and subsequent reprogramming attempts fail as well until the situation is > cleaned up by an hrtimer interrupts. > > For some yet unknown reason this leads to a complete stall, so for now > partially revert the offending commit to a known working state. The root > cause for the stall is still investigated and will be fixed in a subsequent > commit. So with brain more awake I actually managed to decode the problem. It's definitely the expires > cpu_base->expires_next check. It not only prevents the NOHZ idle case from moving the next timer interrupt into the future, it also causes the stall when switching into high resolution / NOHZ mode. At that point the initial base value can be smaller than the next event which prevents reprogramming and as the base value stays stale it prevents any further reprogramming unless there is a full update of the base which makes the problem go away. TBH, that optimization logic to prevent reprogramming the timer hardware for nothing is a bit fragile and non-obvious. I'll have a look to make this more robust and less obscure. Thanks, tglx
![]() |