On Mon, Jan 29, 2018 at 03:20:32PM +0100, Sebastian Andrzej Siewior wrote: > From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > commit d5421ea43d30701e03cadc56a38854c36a8b4433 upstream. > > The hrtimer interrupt code contains a hang detection and mitigation > mechanism, which prevents that a long delayed hrtimer interrupt causes a > continous retriggering of interrupts which prevent the system from making > progress. If a hang is detected then the timer hardware is programmed with > a certain delay into the future and a flag is set in the hrtimer cpu base > which prevents newly enqueued timers from reprogramming the timer hardware > prior to the chosen delay. The subsequent hrtimer interrupt after the delay > clears the flag and resumes normal operation. > > If such a hang happens in the last hrtimer interrupt before a CPU is > unplugged then the hang_detected flag is set and stays that way when the > CPU is plugged in again. At that point the timer hardware is not armed and > it cannot be armed because the hang_detected flag is still active, so > nothing clears that flag. As a consequence the CPU does not receive hrtimer > interrupts and no timers expire on that CPU which results in RCU stalls and > other malfunctions. > > Clear the flag along with some other less critical members of the hrtimer > cpu base to ensure starting from a clean state when a CPU is plugged in. > > Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the > root cause of that hard to reproduce heisenbug. Once understood it's > trivial and certainly justifies a brown paperbag. > > Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic") > Reported-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> > Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Sebastian Sewior <bigeasy@xxxxxxxxxxxxx> > Cc: Anna-Maria Gleixner <anna-maria@xxxxxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx > Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos > [bigeasy: backport to v3.18, drop ->next_timer it was introduced later] Thanks for the backport, now queued up. greg k-h