Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

Jim Mattson <jmattson@xxxxxxxxxx> · Thu, 21 Dec 2023 11:09:57 -0800



On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:
>
>
> Hi!
>
> Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
> that happen in our kernel CI pipeline.
>
>
> The test usually fails because L2 observes TSC after the
> preemption timer deadline, before the VM exit happens.
>
> This happens because KVM emulates nested preemption timer with HR timers,
> so it converts the preemption timer value to nanoseconds, taking in account
> tsc scaling and host tsc frequency, and sets HR timer.
>
> HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
> and thus its rate can be adjusted by NTP, which means that it can run slower or
> faster than KVM expects, which can result in the interrupt arriving earlier,
> or late, which is what is happening.
>
> This is how you can reproduce it on an Intel machine:
>
>
> 1. stop the NTP daemon:
>       sudo systemctl stop chronyd.service
> 2. introduce a small error in the system time:
>       sudo date -s "$(date)"
>
> 3. start NTP daemon:
>       sudo chronyd -d -n  (for debug) or start the systemd service again
>
> 4. run the vmx_preemption_timer test a few times until it fails:
>
>
> I did some research and it looks like I am not the first to encounter this:
>
> From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
> timer subsystem which was even merged but then reverted due to issues:
>
> https://lore.kernel.org/all/1452879670-16133-3-git-send-email-marc.zyngier@xxxxxxx/T/#u
>
> It looks like this issue was later worked around in the ARM code:
>
>
> commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
> Author: Marc Zyngier <maz@xxxxxxxxxx>
> Date:   Wed Apr 6 09:37:22 2016 +0100
>
>     KVM: arm/arm64: Handle forward time correction gracefully
>
>     On a host that runs NTP, corrections can have a direct impact on
>     the background timer that we program on the behalf of a vcpu.
>
>     In particular, NTP performing a forward correction will result in
>     a timer expiring sooner than expected from a guest point of view.
>     Not a big deal, we kick the vcpu anyway.
>
>     But on wake-up, the vcpu thread is going to perform a check to
>     find out whether or not it should block. And at that point, the
>     timer check is going to say "timer has not expired yet, go back
>     to sleep". This results in the timer event being lost forever.
>
>     There are multiple ways to handle this. One would be record that
>     the timer has expired and let kvm_cpu_has_pending_timer return
>     true in that case, but that would be fairly invasive. Another is
>     to check for the "short sleep" condition in the hrtimer callback,
>     and restart the timer for the remaining time when the condition
>     is detected.
>
>     This patch implements the latter, with a bit of refactoring in
>     order to avoid too much code duplication.
>
>     Cc: <stable@xxxxxxxxxxxxxxx>
>     Reported-by: Alexander Graf <agraf@xxxxxxx>
>     Reviewed-by: Alexander Graf <agraf@xxxxxxx>
>     Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
>     Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>
>
>
> So to solve this issue there are two options:
>
>
> 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
>    I don't know if that is feasible and I would be very happy to hear a feedback from you.
>
> 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
>   (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.
>
>   When starting a HR timer I can adjust its period for the difference in rates, which will in most
>   cases produce more correct result that what we have now, but will still fail if the rate
>   is changed at the same time the timer is started or before it expires.
>
>   Or I can also restart the timer, although that might cause more harm than
>   good to the accuracy.
>
>
> What do you think?

Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE
timer is all about (lapic_timer_advance_ns = -1)? If so, can we
leverage that for the VMX-preemption timer as well?
>
> Best regards,
>         Maxim Levitsky
>
>
>