Hi David, On 11/7/23 15:24, David Woodhouse wrote: > On Tue, 2023-11-07 at 15:07 -0800, Dongli Zhang wrote: >> Thank you very much for the detailed explanation. >> >> I agree it is important to resolve the "now" problem. I guess the KVM lapic >> deadline timer has the "now" problem as well. > > I think so. And quite gratuitously so, since it just does: > > now = ktime_get(); > guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); > > > Couldn't that trivially be changed to kvm_get_monotonic_and_clockread()? The core idea is to always capture the pair of (tsc, ns) at exactly the same time point. I have no idea how much accuracy it can improve, considering the extra costs to inject the timer interrupt into the vCPU. > > Thankfully, it's defined in the time domain of the guest TSC, not the > kvmclock, so it doesn't suffer the same drift issue as the Xen timer. > >> I just notice my question missed a key prerequisite: >> >> Would you mind helping explain the time domain of the "oneshot.timeout_abs_ns"? >> >> While it is the absolute nanosecond value at the VM side, on which time domain >> it is based? > > It's the kvmclock. Xen offers as Xen PV clock to its guests using > *precisely* the same pvclock structure as KVM does. > > >> 1. Is oneshot.timeout_abs_ns based on the xen pvclock (freq=NSEC_PER_SEC)? >> >> 2. Is oneshot.timeout_abs_ns based on tsc from VM side? >> >> 3. Is oneshot.timeout_abs_ns based on monotonic/raw clock at VM side? >> >> 4. Or it is based on wallclock? >> >> I think the OS does not have a concept of nanoseconds. It is derived from a >> clocksource. > > It's the kvmclock. Thank you very much! Both xen clock and kvm clock are pvclock based on the same equations. > > The guest derives it from the guest TSC using the pvclock information > (mul/shift/offset) that KVM provides to the guest. > > The kvm_setup_guest_pvclock() function is potentially called *three* > times from kvm_guest_time_update(). Once for the KVM pv time MSR, once > for the pvclock structure in the Xen vcpu_info, and finally for the > pvclock structure which Xen makes available to userspace for vDSO > timekeeping. > >> If it is based on pvclock, is it based on the pvclock from a specific vCPU, as >> both pvclock and timer are per-vCPU. > > Yes, it is per-vCPU. Although in the sane case the TSCs on all vCPUs > will match and the mul/shift/offset provided by KVM won't actually > differ. Even in the insane case where guest TSCs are out of sync, > surely the pvclock information will differ only in order to ensure that > the *result* in nanoseconds does not? > > I conveniently ducked this question in my patch by only supporting the > CONSTANT_TSC case, and not the case where we happen to know the > (potentially different) TSC frequencies on all the different pCPUs and > vCPUs. This is also my question that why to support only the CONSTANT_TSC case. For the lapic timer case: The timer is always calculated based on the *current* vCPU's tsc virtualization, regardless CONSTANT_TSC or not. For the xen timer case: Why not always calculate the expire based on the *current* vCPU's time virtualization? That is, why not always use the current vCPU's hv_clock, regardless CONSTANT_TSC/masteclock? That is: kvm lapic method with kvm_get_monotonic_and_clockread(). > > >> >> E.g., according to the KVM lapic deadline timer, all values are based on (1) the >> tsc value, (2)on the current vCPU. >> >> >> 1949 static void start_sw_tscdeadline(struct kvm_lapic *apic) >> 1950 { >> 1951 struct kvm_timer *ktimer = &apic->lapic_timer; >> 1952 u64 guest_tsc, tscdeadline = ktimer->tscdeadline; >> 1953 u64 ns = 0; >> 1954 ktime_t expire; >> 1955 struct kvm_vcpu *vcpu = apic->vcpu; >> 1956 unsigned long this_tsc_khz = vcpu->arch.virtual_tsc_khz; >> 1957 unsigned long flags; >> 1958 ktime_t now; >> 1959 >> 1960 if (unlikely(!tscdeadline || !this_tsc_khz)) >> 1961 return; >> 1962 >> 1963 local_irq_save(flags); >> 1964 >> 1965 now = ktime_get(); >> 1966 guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); >> 1967 >> 1968 ns = (tscdeadline - guest_tsc) * 1000000ULL; >> 1969 do_div(ns, this_tsc_khz); >> >> >> Sorry if I make the question very confusing. The core question is: where and >> from which clocksource the abs nanosecond value is from? What will happen if the >> Xen VM uses HPET as clocksource, while xen timer as clock event? > > If the guest uses HPET as clocksource and Xen timer as clockevents, > then keeping itself in sync is the *guest's* problem. The Xen timer is > defined in terms of nanoseconds since guest start, as provided in the > pvclock information described above. Hope that helps! > The "in terms of nanoseconds since guest start" refers to *one* global value. Should we use wallclock when we are referring to a global value shared by all vCPUs? Based on the following piece of code, I do not think we may assume all vCPUs have the same pvclock at the same time point: line 104-108, when PVCLOCK_TSC_STABLE_BIT is not set. 67 static __always_inline 68 u64 __pvclock_clocksource_read(struct pvclock_vcpu_time_info *src, bool dowd) 69 { 70 unsigned version; 71 u64 ret; 72 u64 last; 73 u8 flags; 74 75 do { 76 version = pvclock_read_begin(src); 77 ret = __pvclock_read_cycles(src, rdtsc_ordered()); 78 flags = src->flags; 79 } while (pvclock_read_retry(src, version)); ... ... 104 last = raw_atomic64_read(&last_value); 105 do { 106 if (ret <= last) 107 return last; 108 } while (!raw_atomic64_try_cmpxchg(&last_value, &last, ret)); 109 110 return ret; 111 } That's why I appreciate a definition of the abs nanoseconds used by the xen timer (e.g., derived from pvclock). If it is per-vCPU, we may not use it for a global "in terms of nanoseconds since guest start", when PVCLOCK_TSC_STABLE_BIT is not set. Thank you very much! Dongli Zhang