Re: [patch 3/3] PTP: add kvm PTP driver

Radim Krcmar <rkrcmar@xxxxxxxxxx> · Mon, 16 Jan 2017 18:46:49 +0100

2017-01-16 15:04-0200, Marcelo Tosatti:
> On Mon, Jan 16, 2017 at 05:26:53PM +0100, Radim Krcmar wrote:
>> 2017-01-13 15:40-0200, Marcelo Tosatti:
>> > On Fri, Jan 13, 2017 at 04:56:58PM +0100, Radim Krcmar wrote:
>> > > 2017-01-13 10:01-0200, Marcelo Tosatti:
>> >> > +		version = pvclock_read_begin(src);
>> >> > +
>> >> > +		ret = kvm_hypercall2(KVM_HC_CLOCK_OFFSET,
>> >> > +				     clock_off_gpa,
>> >> > +				     KVM_CLOCK_OFFSET_WALLCLOCK);
>> >> > +		if (ret != 0) {
>> >> > +			pr_err("clock offset hypercall ret %lu\n", ret);
>> >> > +			spin_unlock(&kvm_ptp_lock);
>> >> > +			preempt_enable_notrace();
>> >> > +			return -EOPNOTSUPP;
>> >> > +		}
>> >> > +
>> >> > +		tspec.tv_sec = clock_off.sec;
>> >> > +		tspec.tv_nsec = clock_off.nsec;
>> >> > +
>> >> > +		delta = rdtsc_ordered() - clock_off.tsc;
>> >> > +
>> >> > +		offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
>> >> > +					     src->tsc_shift);
>> >> > +
>> >> > +	} while (pvclock_read_retry(src, version));
>> >> > +
>> >> > +	preempt_enable_notrace();
>> >> > +
>> >> > +	tspec.tv_nsec = tspec.tv_nsec + offset;
>> >> > +
>> >> > +	spin_unlock(&kvm_ptp_lock);
>> >> > +
>> >> > +	if (tspec.tv_nsec >= NSEC_PER_SEC) {
>> >> > +		u64 secs = tspec.tv_nsec;
>> >> > +
>> >> > +		tspec.tv_nsec = do_div(secs, NSEC_PER_SEC);
>> >> > +		tspec.tv_sec += secs;
>> >> > +	}
>> >> > +
>> >> > +	memcpy(ts, &tspec, sizeof(struct timespec64));
>> >> 
>> >> But the whole idea is of improving the time by reading tsc a bit later
>> >> is just weird ... why is it better to provide
>> >> 
>> >>   tsc + x, time + tsc_delta_to_time(x)
>> >> 
>> >> than just
>> >> 
>> >>  tsc, time
>> >> 
>> >> ?
>> > 
>> > Because you want to calculate the value of the host realtime clock 
>> > at the moment of ptp_kvm_gettime.
>> > 
>> > We do:
>> > 
>> > 	1. kvm_hypercall.
>> > 	2. get {sec, nsec, guest_tsc}.
>> > 	3. kvm_hypercall returns.
>> > 	4. delay = rdtsc() - guest_tsc.
>> > 
>> > Where delay is the delta (measured with the TSC) between points 2 and 4.
>> 
>> I see now ... the PTP interface is just not good for our purposes.
>> We don't return {sec, nsec, guest_tsc}, we just return {sec, nsec} at
>> some random time in the past.  And to make it a bit more accurate, you
>> add a best-effort delta before returning, which makes sense.
> 
> Not random time in the past. We return {sec, nsec} from the host
> realtime at the moment the user ran the hypercall. 

That is what we want, but {sec, nsec} is not anchored to any running
time, so it is unavoidably late when we return it.

> Since PTP is very accurate, that "a bit more" counts, yes.
> 
>> When we have to depend on pvclock, what are the advantages of not using
>> the existing pvclock API for wall clock?
>> (You mentioned some extensions.)
>> 
>>   struct pvclock_wall_clock {
>>   	u32   version;
>>   	u32   sec;
>>   	u32   nsec;
>>   } __attribute__((__packed__));
>
>> It gives the wall clock when pvclock was 0, so you just add current
>> kvmclock and get the host wall clock.  
> 
> Well, no. For one, the TSC part of kvmclock: 
> 
> 	kvmclock-read = system_timestamp + convert-to-1GHz(rdtsc() - tsc_timestamp)
> 				           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> Drifts relative to UTC. This part can be large.
> The guests NTP is responsible for fixing
> that drift of the guests realtime clock (talking about current setup, 
> without KVM PTP driver).
> 
> Now, we want very high precision (less than 1us) for this
> driver. Very small TSC drifts on a large delta defeat the purpose.

True.

>> Without a VM exit.
> 
> Huge performance is not an issue. Accuracy (how different from the host
> realtime clock our "approximation" of the host realtime clock) is.

Ah, ok.

>> And how often is ptp_kvm_gettime() usually called?
> 
> The PTP_SYS_OFFSET ioctl calls the following code in a loop:
> 
> struct ptp_sys_offset {
>         unsigned int n_samples; /* Desired number of measurements. */
>         unsigned int rsv[3];    /* Reserved for future use. */
>         /*
>          * Array of interleaved system/phc time stamps. The kernel
>          * will provide 2*n_samples + 1 time stamps, with the last
>          * one as a system time stamp.
>          */
>         struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1];
> };
> 
> #define PTP_MAX_SAMPLES 25 /* Maximum allowed offset measurement
> samples. */
> 
>         case PTP_SYS_OFFSET:
>                 sysoff = memdup_user((void __user *)arg,
> sizeof(*sysoff));
>                 if (IS_ERR(sysoff)) {
>                         err = PTR_ERR(sysoff);
>                         sysoff = NULL;
>                         break;
>                 }
>                 if (sysoff->n_samples > PTP_MAX_SAMPLES) {
>                         err = -EINVAL;
>                         break;
>                 }
>                 pct = &sysoff->ts[0];
>                 for (i = 0; i < sysoff->n_samples; i++) {
>                         getnstimeofday64(&ts);
>                         pct->sec = ts.tv_sec;
>                         pct->nsec = ts.tv_nsec;
>                         pct++;
>                         ptp->info->gettime64(ptp->info, &ts);
>                         pct->sec = ts.tv_sec;
>                         pct->nsec = ts.tv_nsec;
>                         pct++;

Hm, this loop alternates between system time and ptp time and I'd guess
that userspace then manipulates deltas of these two readings and applies
the result to system time.
I'm not even sure that pvclock delta improves the situation anymore --
it adds an offset to all gettime64() reads, but that can be computed
from multiple datapoints.  (And the delta also adds uncertainty to
results.)

>                 }
>                 getnstimeofday64(&ts);
>                 pct->sec = ts.tv_sec;
>                 pct->nsec = ts.tv_nsec;
> 
> How often that ioctl is called depends on the parameters of the Chrony
> PHC code. Initially (to determine the clock difference Chrony should call it 
> more frequently, later on it should call it less frequency).
> 
> Perhaps once every second initially (the ioctl). I'll confirm with the
> exact value for my setup and reply to this email.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html