Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

Jim Mattson <jmattson@xxxxxxxxxx> · Thu, 16 May 2024 09:53:24 -0700

On Wed, May 15, 2024 at 2:03 PM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:
>
> On Tue, 2024-01-02 at 15:49 -0800, Jim Mattson wrote:
> > On Tue, Jan 2, 2024 at 2:21 PM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:
> > > On Thu, 2023-12-21 at 11:09 -0800, Jim Mattson wrote:
> > > > On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote:
> > > > > Hi!
> > > > >
> > > > > Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
> > > > > that happen in our kernel CI pipeline.
> > > > >
> > > > >
> > > > > The test usually fails because L2 observes TSC after the
> > > > > preemption timer deadline, before the VM exit happens.
> > > > >
> > > > > This happens because KVM emulates nested preemption timer with HR timers,
> > > > > so it converts the preemption timer value to nanoseconds, taking in account
> > > > > tsc scaling and host tsc frequency, and sets HR timer.
> > > > >
> > > > > HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
> > > > > and thus its rate can be adjusted by NTP, which means that it can run slower or
> > > > > faster than KVM expects, which can result in the interrupt arriving earlier,
> > > > > or late, which is what is happening.
> > > > >
> > > > > This is how you can reproduce it on an Intel machine:
> > > > >
> > > > >
> > > > > 1. stop the NTP daemon:
> > > > >       sudo systemctl stop chronyd.service
> > > > > 2. introduce a small error in the system time:
> > > > >       sudo date -s "$(date)"
> > > > >
> > > > > 3. start NTP daemon:
> > > > >       sudo chronyd -d -n  (for debug) or start the systemd service again
> > > > >
> > > > > 4. run the vmx_preemption_timer test a few times until it fails:
> > > > >
> > > > >
> > > > > I did some research and it looks like I am not the first to encounter this:
> > > > >
> > > > > From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
> > > > > timer subsystem which was even merged but then reverted due to issues:
> > > > >
> > > > > https://lore.kernel.org/all/1452879670-16133-3-git-send-email-marc.zyngier@xxxxxxx/T/#u
> > > > >
> > > > > It looks like this issue was later worked around in the ARM code:
> > > > >
> > > > >
> > > > > commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
> > > > > Author: Marc Zyngier <maz@xxxxxxxxxx>
> > > > > Date:   Wed Apr 6 09:37:22 2016 +0100
> > > > >
> > > > >     KVM: arm/arm64: Handle forward time correction gracefully
> > > > >
> > > > >     On a host that runs NTP, corrections can have a direct impact on
> > > > >     the background timer that we program on the behalf of a vcpu.
> > > > >
> > > > >     In particular, NTP performing a forward correction will result in
> > > > >     a timer expiring sooner than expected from a guest point of view.
> > > > >     Not a big deal, we kick the vcpu anyway.
> > > > >
> > > > >     But on wake-up, the vcpu thread is going to perform a check to
> > > > >     find out whether or not it should block. And at that point, the
> > > > >     timer check is going to say "timer has not expired yet, go back
> > > > >     to sleep". This results in the timer event being lost forever.
> > > > >
> > > > >     There are multiple ways to handle this. One would be record that
> > > > >     the timer has expired and let kvm_cpu_has_pending_timer return
> > > > >     true in that case, but that would be fairly invasive. Another is
> > > > >     to check for the "short sleep" condition in the hrtimer callback,
> > > > >     and restart the timer for the remaining time when the condition
> > > > >     is detected.
> > > > >
> > > > >     This patch implements the latter, with a bit of refactoring in
> > > > >     order to avoid too much code duplication.
> > > > >
> > > > >     Cc: <stable@xxxxxxxxxxxxxxx>
> > > > >     Reported-by: Alexander Graf <agraf@xxxxxxx>
> > > > >     Reviewed-by: Alexander Graf <agraf@xxxxxxx>
> > > > >     Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
> > > > >     Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx>
> > > > >
> > > > >
> > > > > So to solve this issue there are two options:
> > > > >
> > > > >
> > > > > 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
> > > > >    I don't know if that is feasible and I would be very happy to hear a feedback from you.
> > > > >
> > > > > 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
> > > > >   (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.
> > > > >
> > > > >   When starting a HR timer I can adjust its period for the difference in rates, which will in most
> > > > >   cases produce more correct result that what we have now, but will still fail if the rate
> > > > >   is changed at the same time the timer is started or before it expires.
> > > > >
> > > > >   Or I can also restart the timer, although that might cause more harm than
> > > > >   good to the accuracy.
> > > > >
> > > > >
> > > > > What do you think?
> > > >
> > > > Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE
> > > > timer is all about (lapic_timer_advance_ns = -1)?
> > >
> > > Hi,
> > >
> > > I don't think that 'lapic_timer_advance' is designed for that but it does
> > > mask this problem somewhat.
> > >
> > > The goal of 'lapic_timer_advance' is to decrease time between deadline passing and start
> > > of guest timer irq routine by making the deadline happen a bit earlier (by timer_advance_ns), and then busy-waiting
> > > (hopefully only a bit) until the deadline passes, and then immediately do the VM entry.
> > >
> > > This way instead of overhead of VM exit and VM entry that both happen after the deadline,
> > > only the VM entry happens after the deadline.
> > >
> > >
> > > In relation to NTP interference: If the deadline happens earlier than expected, then
> > > KVM will busy wait and decrease the 'timer_advance_ns', and next time the deadline
> > > will happen a bit later thus adopting for the NTP adjustment somewhat.
> > >
> > > Note though that 'timer_advance_ns' variable is unsigned and adjust_lapic_timer_advance can underflow
> > > it, which can be fixed.
> > >
> > > Now if the deadline happens later than expected, then the guest will see this happen,
> > > but at least adjust_lapic_timer_advance should increase the 'timer_advance_ns' so next
> > > time the deadline will happen earlier which will also eventually hide the problem.
> > >
> > > So overall I do think that implementing the 'lapic_timer_advance' for nested VMX preemption timer
> > > is a good idea, especially since this feature is not really nested in some sense - the timer is
> > > just delivered as a VM exit but it is always delivered to L1, so VMX preemption timer can
> > > be seen as just an extra L1's deadline timer.
> > >
> > > I do think that nested VMX preemption timer should use its own value of timer_advance_ns, thus
> > > we need to extract the common code and make both timers use it. Does this make sense?
> >
> > Alternatively, why not just use the hardware VMX-preemption timer to
> > deliver the virtual VMX-preemption timer?
> >
> > Today, I believe that we only use the hardware VMX-preemption timer to
> > deliver the virtual local APIC timer. However, it shouldn't be that
> > hard to pick the first deadline of {VMX-preemption timer, local APIC
> > timer} at each emulated VM-entry to L2.
>
> I assume that this is possible but it might add some complexity.
>
> AFAIK the design choice here was that L1 uses the hardware VMX preemption timer always,
> while L2 uses the software preemption timer which is relatively simple.
>
> I do agree that this might work and if it does work it might be even worthwhile
> change on its own.
>
> If you agree that this is a good idea, I can prepare a patch series for that.

I do think it would be worthwhile to provide the infrastructure for
multiple clients of the VMX-preemption timer. (Better yet would be to
provide a CLOCK_MONOTONIC_RAW hrtimer, but that's outwith our domain.)

> Note though that the same problem (although somewhat masked by lapic_timer_advance)
> does exit on AMD as well because while AMD lacks both VMX preemption timer and even the TSC deadline timer,
> KVM exposes TSC deadline to the guest, and HR timers are always used for its emulation,
> and are prone to NTP interference as I discovered.

It's not a problem if userspace ignores KVM's claim to support TSC deadline. :)

> Best regards,
>         Maxim Levitsky
>
> >
> > > Best regards,
> > >         Maxim Levitsky
> > >
> > >
> > > >  If so, can we
> > > > leverage that for the VMX-preemption timer as well?
> > > > > Best regards,
> > > > >         Maxim Levitsky
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
>
>
>
>