Hi! Recently I was tasked with triage of the failures of 'vmx_preemption_timer' that happen in our kernel CI pipeline. The test usually fails because L2 observes TSC after the preemption timer deadline, before the VM exit happens. This happens because KVM emulates nested preemption timer with HR timers, so it converts the preemption timer value to nanoseconds, taking in account tsc scaling and host tsc frequency, and sets HR timer. HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC, and thus its rate can be adjusted by NTP, which means that it can run slower or faster than KVM expects, which can result in the interrupt arriving earlier, or late, which is what is happening. This is how you can reproduce it on an Intel machine: 1. stop the NTP daemon: sudo systemctl stop chronyd.service 2. introduce a small error in the system time: sudo date -s "$(date)" 3. start NTP daemon: sudo chronyd -d -n (for debug) or start the systemd service again 4. run the vmx_preemption_timer test a few times until it fails: I did some research and it looks like I am not the first to encounter this: >From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with timer subsystem which was even merged but then reverted due to issues: https://lore.kernel.org/all/1452879670-16133-3-git-send-email-marc.zyngier@xxxxxxx/T/#u It looks like this issue was later worked around in the ARM code: commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45 Author: Marc Zyngier <maz@xxxxxxxxxx> Date: Wed Apr 6 09:37:22 2016 +0100 KVM: arm/arm64: Handle forward time correction gracefully On a host that runs NTP, corrections can have a direct impact on the background timer that we program on the behalf of a vcpu. In particular, NTP performing a forward correction will result in a timer expiring sooner than expected from a guest point of view. Not a big deal, we kick the vcpu anyway. But on wake-up, the vcpu thread is going to perform a check to find out whether or not it should block. And at that point, the timer check is going to say "timer has not expired yet, go back to sleep". This results in the timer event being lost forever. There are multiple ways to handle this. One would be record that the timer has expired and let kvm_cpu_has_pending_timer return true in that case, but that would be fairly invasive. Another is to check for the "short sleep" condition in the hrtimer callback, and restart the timer for the remaining time when the condition is detected. This patch implements the latter, with a bit of refactoring in order to avoid too much code duplication. Cc: <stable@xxxxxxxxxxxxxxx> Reported-by: Alexander Graf <agraf@xxxxxxx> Reviewed-by: Alexander Graf <agraf@xxxxxxx> Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx> Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxxxxx> So to solve this issue there are two options: 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers. I don't know if that is feasible and I would be very happy to hear a feedback from you. 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks. When starting a HR timer I can adjust its period for the difference in rates, which will in most cases produce more correct result that what we have now, but will still fail if the rate is changed at the same time the timer is started or before it expires. Or I can also restart the timer, although that might cause more harm than good to the accuracy. What do you think? Best regards, Maxim Levitsky