Hi David, On 11/8/23 07:25, David Woodhouse wrote: > On Tue, 2023-11-07 at 17:43 -0800, Dongli Zhang wrote: >> Hi David, >> >> On 11/7/23 15:24, David Woodhouse wrote: >>> On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote: >>>> Thank you very much for the detailed explanation. >>>> >>>> I agree it is important to resolve the "now" problem. I guess the KVM lapic >>>> deadline timer has the "now" problem as well. >>> >>> I think so. And quite gratuitously so, since it just does: >>> >>> now = ktime_get(); >>> guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>> >>> >>> Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()? >> >> The core idea is to always capture the pair of (tsc, ns) at exactly the same >> time point. >> >> I have no idea how much accuracy it can improve, considering the extra costs to >> inject the timer interrupt into the vCPU. > > Right. It's probably in the noise most of the time, unless you're > unlucky enough to get preempted between the two TSC reads which are > supposed to be happening "at the same time". >> > >>> I conveniently ducked this question in my patch by only supporting the >>> CONSTANT_TSC case, and not the case where we happen to know the >>> (potentially different) TSC frequencies on all the different pCPUs and >>> vCPUs. >> >> This is also my question that why to support only the CONSTANT_TSC case. >> >> For the lapic timer case: >> >> The timer is always calculated based on the *current* vCPU's tsc virtualization, >> regardless CONSTANT_TSC or not. >> >> For the xen timer case: >> >> Why not always calculate the expire based on the *current* vCPU's time >> virtualization? That is, why not always use the current vCPU's hv_clock, >> regardless CONSTANT_TSC/masteclock? > > The simple answer is because I wasn't sure it would work correctly in > all cases, and didn't *care* enough about the non-CONSTANT_TSC case to > prove it to myself. > > Let's think about it... > > In the non-CONSTANT_TSC case, each physical CPU can have a different > TSC frequency, yes? And KVM has a cpufreq notifier which triggers when > the TSC changes, and make a KVM_REQ_CLOCK_UPDATE request to any vCPU > running on the affected pCPU. With an associated IPI to ensure the vCPU > exits guest mode and will processes the update before executing any > further guest code. > > If a vCPU had *previously* been running on the affected pCPU but wasn't > running when the notifier happened, then kvm_arch_vcpu_load() will make > a KVM_REQ_GLOBAL_CLOCK_UPDATE request, which will immediately update > the vCPU in question, and then trigger a deferred KVM_REQ_CLOCK_UPDATE > for the others. > > So the vCPU itself, in guest mode, is always going to see *correct* > pvclock information corresponding to the pCPU it is running on at the > time. > > (I *believe* the way this works is that when a vCPU runs on a pCPU > which has a TSC frequency lower than the vCPU should have, it runs in > 'always catchup' mode. Where the TSC offset is bumped *every* time the > vCPU enters guest mode, so the TSC is about right on every entry, might > seem to run a little slow if the vCPU does a tight loop of rdtsc, but > will catch up again on next vmexit/entry?) > > But we aren't talking about the vCPU running in guest mode. The code in > kvm_xen_start_timer() and in start_sw_tscdeadline() is running in the > host kernel. How can we be sure that it's running on the *same* > physical CPU that the vCPU was previously running on, and thus how can > we be sure that the vcpu->arch.hv_clock is valid with respect to a > rdtsc on the current pCPU? I don't know that we can know that. > > As far as I can tell, the code in start_sw_tscdeadline() makes no > attempt to do the 'catchup' thing, and just converts the pCPU's TSC to > guest TSC using kvm_read_l1_tsc() — which uses a multiplier that's set > once and *never* recalculated when the host TSC frequency changes. > > On the whole, now I *have* thought about it, I'm even more convinced I > was right in the first place that I didn't want to know :) > > I think I stand by my original decision that the Xen timer code in the > non-CONSTANT_TSC case can just use get_kvmclock_ns(). The "now" problem > is going to be in the noise if the TSC isn't constant anyway, and we > need to fix the drift and jumps of get_kvmclock_ns() *anyway* rather > than adding a temporary special case for the Xen timers. > >> That is: kvm lapic method with kvm_get_monotonic_and_clockread(). >> >>> >>> >>>> >>>> E.g., according to the KVM lapic deadline timer, all values are based on (1) the >>>> tsc value, (2)on the current vCPU. >>>> >>>> >>>> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic) >>>> 1950 { >>>> 1951 struct kvm_timer *ktimer = &apic->lapic_timer; >>>> 1952 u64 guest_tsc, tscdeadline = ktimer->tscdeadline; >>>> 1953 u64 ns = 0; >>>> 1954 ktime_t expire; >>>> 1955 struct kvm_vcpu *vcpu = apic->vcpu; >>>> 1956 unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz; >>>> 1957 unsigned long flags; >>>> 1958 ktime_t now; >>>> 1959 >>>> 1960 if (unlikely(!tscdeadline || !this_tsc_khz)) >>>> 1961 return; >>>> 1962 >>>> 1963 local_irq_save(flags); >>>> 1964 >>>> 1965 now = ktime_get(); >>>> 1966 guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >>>> 1967 >>>> 1968 ns = (tscdeadline - guest_tsc) * 1000000ULL; >>>> 1969 do_div(ns, this_tsc_khz); >>>> >>>> >>>> Sorry if I make the question very confusing. The core question is: where and >>>> from which clocksource the abs nanosecond value is from? What will happen if the >>>> Xen VM uses HPET as clocksource, while xen timer as clock event? >>> >>> If the guest uses HPET as clocksource and Xen timer as clockevents, >>> then keeping itself in sync is the *guest's* problem. The Xen timer is >>> defined in terms of nanoseconds since guest start, as provided in the >>> pvclock information described above. Hope that helps! >>> >> >> The "in terms of nanoseconds since guest start" refers to *one* global value. >> Should we use wallclock when we are referring to a global value shared by all vCPUs? >> >> >> Based on the following piece of code, I do not think we may assume all vCPUs >> have the same pvclock at the same time point: line 104-108, when >> PVCLOCK_TSC_STABLE_BIT is not set. >> > > The *result* of calculating the pvclock should be the same on all vCPUs > at any given moment in time. > > The precise *calculation* may differ, depending on the frequency of the > TSC for that particular vCPU and the last time the pvclock information > was created for that vCPU. > > >> >> 67 static __always_inline >> 68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd) >> 69 { >> 70 unsigned version; >> 71 u64 ret; >> 72 u64 last; >> 73 u8 flags; >> 74 >> 75 do { >> 76 version = pvclock_read_begin(src); >> 77 ret = __pvclock_read_cycles(src, rdtsc_ordered()); >> 78 flags = src->flags; >> 79 } while (pvclock_read_retry(src, version)); >> ... ... >> 104 last = raw_atomic64_read(&last_value); >> 105 do { >> 106 if (ret <= last) >> 107 return last; >> 108 } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret)); >> 109 >> 110 return ret; >> 111 } >> >> >> That's why I appreciate a definition of the abs nanoseconds used by the xen >> timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a >> global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT >> is not set. > > It is only per-vCPU if the vCPUs have *different* TSC frequencies. > That's because of the scaling; the guest calculates the nanoseconds > from the *guest* TSC of course, scaled according to the pvclock > information given to the guest by KVM. > > As discussed and demonstrated by http://david.woodhou.se/tsdrift.c , if > KVM scales directly to nanoseconds from the *host* TSC at its known > frequency, that introduces a systemic drift between what the guest > calculates, and what KVM calculates — even in the CONSTANT_TSC case. > > How do we reconcile the two? Well, it makes no sense for the definition > of the pvclock to be something that the guest *cannot* calculate, so > obviously KVM must do the same calculations the guest does; scale to > the guest TSC (kvm_read_l1_tsc()) and then apply the same pvclock > information from vcpu->arch.hvclock to get the nanoseconds. > > In the sane world where the guest vCPUs all have the *same* TSC > frequency, that's fine. The kvmclock isn't *really* per-vCPU because > they're all the same. > > If the VMM sets up different vCPUs to have different TSC frequencies > then yes, their kvmclock will drift slightly apart over time. That > might be the *one* case where I will accept that the guest pvclock > might ever change, even in the CONSTANT_TSC environment (without host > suspend or any other nonsense). > > In that patch I started typing on Monday and *still* haven't got round > to finishing because other things keep catching fire, I'm using the > *KVM-wide* guest TSC frequency as the definition for the kvmclock. > > Thank you very much for the explanation. I understand you may use different methods to obtain the 'expire' under different cases. Maybe add some comments in the KVM code of xen timer emulation? E.g.: - When the TSC is reliable, follow the standard/protocol that xen timer is per-vCPU pvclock based: that is, to always scale host_tsc with kvm_read_l1_tsc(). - However, sometimes TSC is not reliable. Use the legacy method get_kvmclock_ns(). This may help developers understand the standard/protocol used by xen timer. The core idea will be: the implementation is trying to following the xen timer nanoseconds definition (per-vCPU pvclock), and it may use other legacy solution under special case, in order to improve the accuracy. TBH, I never think about what the definition of nanosecond is in xen timer (even I used to and I am still working on some xen issue). Thank you very much! Dongli Zhang