Re: [PATCH] timekeeping: Align tick_sched_timer() with the HZ tick. -- regression report

Mathias Krause <minipli@xxxxxxxxxxxxxx> · Wed, 14 Jun 2023 00:59:46 +0200

Hi Sebastian,

On 18.04.23 14:26, Sebastian Andrzej Siewior wrote:
> [...]. The timer still fires
> every 4ms with HZ=250 but timer is no longer aligned with
> CLOCK_MONOTONIC with 0 as it origin but has an offset in the us/ns part
> of the timestamp. The offset differs with every boot and makes it
> impossible for user land to align with the tick.

I can observe these per-boot offsets too, but...

> Align the tick timer with CLOCK_MONOTONIC ensuring that it is always a
> multiple of 1000/CONFIG_HZ ms.

this change doesn't seem to achieve that goal, unfortunately. Quite the 
opposite. It makes the (boot) clock run faster and, because of the per-
boot different offset, differently fast for each boot. Up to the point 
where it's running too fast to make any progress at all.

This patch causes VM boot hangs for us. It took a while to identify as 
the boot hangs were only ~1 out of 30 but it's clearly it. Reverting 
the commit got me 100 boots in a row without any issue.

Instrumenting the kernel a little gave me a clue what the bug is. When 
switching from the boot timer tick device (which is 'hpet' in my setup) 
to 'lapic-deadline', the mode of the timer isn't changed and kept at 
TICKDEV_MODE_PERIODIC. As that device doesn't support this mode, 
tick_setup_periodic() will switch over to CLOCK_EVT_STATE_ONESHOT mode 
and program the next expire event based on tick_next_period.

clockevents_program_event() will calculate the delta of that timestamp 
and ktime_get() and pass that value on to dev->set_next_event() (which 
is lapic_next_deadline()) which will write it to the IA32_TSC_DEADLINE 
MSR.

That delta value -- which is still the per-boot different offset to 
ktime_get() your patch introduces -- now gets stuck and is taken as the 
new *jiffies tick time*. That's because tick_handle_periodic() -> 
tick_periodic() will advance tick_next_period by TICK_NSEC, make 
do_timer() increment jiffies_64 by one and then program the next event 
to be in TICK_NSEC ns based on the device's old expiry time, i.e. keep 
the offset intact. This is followed by re-arming the event by a call to 
clockevents_program_event() which does the already-know delta 
calculation and writes, again, the too little value to 
IA32_TSC_DEADLINE.

This effectively makes the jiffies based clock go too fast as the timer 
IRQ comes too early (less than TICK_NSEC ns). Sometimes it's barely 
noticeable, but sometimes it's so fast that the kernel is overloaded 
with only handling the local timer IRQ without making any further 
progress, especially in (nested) VM setups.

Without commit e9523a0d8189 ("tick/common: Align tick period with the 
HZ tick."), which was backported to many stable and LTS kernels (v6.3.2 
(571c3b46c9b3), v6.2.15 (f0cb827199ec), v6.1.28 (290e26ec0d01), 
v5.15.111 (a55050c7989c), v5.10.180 (c4013689269d) and v5.4.243 
(a3e7a3d472c2)) this clock drift is gone and my VMs boot again.

Before that commit, the delta between tick_next_period and ktime_get() 
was initially zero, so tick_handle_period() had to loop, as 
clockevents_program_event() will return with -ETIME. The next attempt 
would be done with a delta of TICK_NSEC which will make 
clockevents_program_event() succeed and ensure that future events don't 
need the additional loop iteration, as the delta got stuck at TICK_NSEC 
-- exactly where it should be.

We observed the bug first on the v6.3, v6.1 and v5.15 stable branch 
updates from May 11th and then, a week later, on v5.4 too. All first 
occurrences were coinciding with the bad commit going into the 
corresponding stable and LTS kernel releases.

The issue manifests itself as a fast running clock only during boot, 
when the clock source is still jiffies based. That'll eventually lead 
to a boot hang as the timer IRQs are firing too fast.

To reproduce this you can either boot loop a VM and try to get "lucky" 
to hit a big enough 'rem' value or just apply this little diff instead:

---8<---

diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
index 65b8658da829..b01cf18a5d42 100644
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -225,6 +225,7 @@ static void tick_setup_device(struct tick_device *td,
 
 			next_p = ktime_get();
 			div_u64_rem(next_p, TICK_NSEC, &rem);
+			rem = TICK_NSEC - 123;
 			if (rem) {
 				next_p -= rem;
 				next_p += TICK_NSEC;
--->8---

This should make the kernel get stuck with only handling timer ticks 
but not making any further progress.

Change the subtrahend to 1234 to get a system that boots but has an 
unrealistically fast clock during kernel initialization.

As reverting that commit fixes the issue for us but it seemingly fixes 
another bug for Klaus (or at least attempted to), the now uncovered bug 
should be fixed instead.

The fundamental issue is that the jiffies based clock source cannot be 
trusted and shouldn't be used to calculate offsets to timestamps in the 
future when tick_next_period mod ktime_get() != 0.

Can we defer the offset adjustment of tick_next_period to a later point 
in time when a stable clock source gets used, like 'tsc'?

Thanks,
Mathias