Hi,
In our multi-core x86 based system that is running 3.4.19 version of kernel, hrtimer_interrupt (called from apic_timer_interrupt) keeps looping in hardirq for atleast 1.6 seconds. We use tsc as our clock source. The issue happens very rarely in our system and hard to reproduce.
Problem:
Inside hrtimer_interrupt function, basenow.tv64 in CPU-3 is 1.6 seconds ahead of other CPU’s (we have 4 cores), whereas hrtimer->_softexpires.tv64 is in sync with remaining CPU’s. Due to this, the if condition inside hrtimer_interrupt where we check if basenow.tv64 < hrtimer_get_softexpires_tv64(timer) is not true for 1.6 seconds, which cause the while loop inside hrtimer_interrupt to not exit. Below is the ftrace captured during the problem.
<idle>-0 [002] d.h. 800364.533632: hrtimer_expire_entry: hrtimer=ffff88017fd0c960 function=tick_sched_timer now=801616439840902
ksoftirqd/3-19 [003] dNh. 800364.539178: hrtimer_expire_entry: hrtimer=ffff88017fd8c960 function=tick_sched_timer now=801618042768641
ksoftirqd/3-19 [003] dNh. 800364.539185: hrtimer_start: hrtimer=ffff88017fd8c960 function=tick_sched_timer expires=801616446505014 softexpires=801616446505014
As we can see, the difference in now time between CPU-2 and CPU-3(where the time jump is seen) is significant. Ftrace indicates that the now time has drifted apart in CPU-3 by 1602 milliseconds, even though timestamp is apart by only 6 milliseconds. Also since the hrtimer expiry time is in the past, we end up spending lot of time in hardirq. From my understanding of the code, , basenow.tv64 is computed in hrtimer_update_base() ->ktime_get_update_offsets() as timekeeper.xtime – offs_real. Both timekeeper.xtime and offs_real are always updated under a lock. So, I am still unsure on how only one core is seeing the time incorrectly.
Any inputs will be greatly help.
Thanks,
Raj
_______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies