On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <mlevitsk@xxxxxxxxxx> wrote: > > > Hi! > > Recently I was tasked with triage of the failures of 'vmx_preemption_timer' > that happen in our kernel CI pipeline. > > > The test usually fails because L2 observes TSC after the > preemption timer deadline, before the VM exit happens. > > This happens because KVM emulates nested preemption timer with HR timers, > so it converts the preemption timer value to nanoseconds, taking in account > tsc scaling and host tsc frequency, and sets HR timer. > > HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC, > and thus its rate can be adjusted by NTP, which means that it can run slower or > faster than KVM expects, which can result in the interrupt arriving earlier, > or late, which is what is happening. > > This is how you can reproduce it on an Intel machine: > > > 1. stop the NTP daemon: > sudo systemctl stop chronyd.service > 2. introduce a small error in the system time: > sudo date -s "$(date)" > > 3. start NTP daemon: > sudo chronyd -d -n (for debug) or start the systemd service again > > 4. run the vmx_preemption_timer test a few times until it fails: > > > I did some research and it looks like I am not the first to encounter this: > > From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with > timer subsystem which was even merged but then reverted due to issues: > > https://lore.kernel.org/all/1452879670-16133-3-git-send-email-marc.zyngier@xxxxxxx/T/#u > > It looks like this issue was later worked around in the ARM code: > > > commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45 > Author: Marc Zyngier <maz@xxxxxxxxxx> > Date: Wed Apr 6 09:37:22 2016 +0100 > > KVM: arm/arm64: Handle forward time correction gracefully > > On a host that runs NTP, corrections can have a direct impact on > the background timer that we program on the behalf of a vcpu. > > In particular, NTP performing a forward correction will result in > a timer expiring sooner than expected from a guest point of view. > Not a big deal, we kick the vcpu anyway. > > But on wake-up, the vcpu thread is going to perform a check to > find out whether or not it should block. And at that point, the > timer check is going to say "timer has not expired yet, go back > to sleep". This results in the timer event being lost forever. > > There are multiple ways to handle this. One would be record that > the timer has expired and let kvm_cpu_has_pending_timer return > true in that case, but that would be fairly invasive. Another is > to check for the "short sleep" condition in the hrtimer callback, > and restart the timer for the remaining time when the condition > is detected. > > This patch implements the latter, with a bit of refactoring in > order to avoid too much code duplication. > > Cc: <stable@xxxxxxxxxxxxxxx> > Reported-by: Alexander Graf <agraf@xxxxxxx> > Reviewed-by: Alexander Graf <agraf@xxxxxxx> > Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx> > Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx> > > > So to solve this issue there are two options: > > > 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers. > I don't know if that is feasible and I would be very happy to hear a feedback from you. > > 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system > (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks. > > When starting a HR timer I can adjust its period for the difference in rates, which will in most > cases produce more correct result that what we have now, but will still fail if the rate > is changed at the same time the timer is started or before it expires. > > Or I can also restart the timer, although that might cause more harm than > good to the accuracy. > > > What do you think? Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE timer is all about (lapic_timer_advance_ns = -1)? If so, can we leverage that for the VMX-preemption timer as well? > > Best regards, > Maxim Levitsky > > >