RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Thu, 21 Dec 2023 18:51:50 +0200

Hi!

Recently I was tasked with triage of the failures of 'vmx_preemption_timer' 
that happen in our kernel CI pipeline.

The test usually fails because L2 observes TSC after the 
preemption timer deadline, before the VM exit happens.

This happens because KVM emulates nested preemption timer with HR timers, 
so it converts the preemption timer value to nanoseconds, taking in account
tsc scaling and host tsc frequency, and sets HR timer.

HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC, 
and thus its rate can be adjusted by NTP, which means that it can run slower or 
faster than KVM expects, which can result in the interrupt arriving earlier,
or late, which is what is happening.

This is how you can reproduce it on an Intel machine:

1. stop the NTP daemon: 
      sudo systemctl stop chronyd.service
2. introduce a small error in the system time:
      sudo date -s "$(date)"

3. start NTP daemon:
      sudo chronyd -d -n  (for debug) or start the systemd service again

4. run the vmx_preemption_timer test a few times until it fails:

I did some research and it looks like I am not the first to encounter this:

>From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
timer subsystem which was even merged but then reverted due to issues:

https://lore.kernel.org/all/1452879670-16133-3-git-send-email-marc.zyngier@xxxxxxx/T/#u

It looks like this issue was later worked around in the ARM code:

commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
Author: Marc Zyngier <maz@xxxxxxxxxx>
Date:   Wed Apr 6 09:37:22 2016 +0100

    KVM: arm/arm64: Handle forward time correction gracefully

    On a host that runs NTP, corrections can have a direct impact on
    the background timer that we program on the behalf of a vcpu.

    In particular, NTP performing a forward correction will result in
    a timer expiring sooner than expected from a guest point of view.
    Not a big deal, we kick the vcpu anyway.

    But on wake-up, the vcpu thread is going to perform a check to
    find out whether or not it should block. And at that point, the
    timer check is going to say "timer has not expired yet, go back
    to sleep". This results in the timer event being lost forever.

    There are multiple ways to handle this. One would be record that
    the timer has expired and let kvm_cpu_has_pending_timer return
    true in that case, but that would be fairly invasive. Another is
    to check for the "short sleep" condition in the hrtimer callback,
    and restart the timer for the remaining time when the condition
    is detected.

    This patch implements the latter, with a bit of refactoring in
    order to avoid too much code duplication.

    Cc: <stable@xxxxxxxxxxxxxxx>
    Reported-by: Alexander Graf <agraf@xxxxxxx>
    Reviewed-by: Alexander Graf <agraf@xxxxxxx>
    Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
    Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>

So to solve this issue there are two options:

1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers. 
   I don't know if that is feasible and I would be very happy to hear a feedback from you.

2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
  (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.

  When starting a HR timer I can adjust its period for the difference in rates, which will in most
  cases produce more correct result that what we have now, but will still fail if the rate
  is changed at the same time the timer is started or before it expires.

  Or I can also restart the timer, although that might cause more harm than
  good to the accuracy.

What do you think?

Best regards,
	Maxim Levitsky