Re: [patch 3/3] PTP: add kvm PTP driver

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Mon, 16 Jan 2017 15:04:18 -0200

On Mon, Jan 16, 2017 at 05:26:53PM +0100, Radim Krcmar wrote:
> 2017-01-13 15:40-0200, Marcelo Tosatti:
> > On Fri, Jan 13, 2017 at 04:56:58PM +0100, Radim Krcmar wrote:
> > > 2017-01-13 10:01-0200, Marcelo Tosatti:
> >> > +		version = pvclock_read_begin(src);
> >> > +
> >> > +		ret = kvm_hypercall2(KVM_HC_CLOCK_OFFSET,
> >> > +				     clock_off_gpa,
> >> > +				     KVM_CLOCK_OFFSET_WALLCLOCK);
> >> > +		if (ret != 0) {
> >> > +			pr_err("clock offset hypercall ret %lu\n", ret);
> >> > +			spin_unlock(&kvm_ptp_lock);
> >> > +			preempt_enable_notrace();
> >> > +			return -EOPNOTSUPP;
> >> > +		}
> >> > +
> >> > +		tspec.tv_sec = clock_off.sec;
> >> > +		tspec.tv_nsec = clock_off.nsec;
> >> > +
> >> > +		delta = rdtsc_ordered() - clock_off.tsc;
> >> > +
> >> > +		offset = pvclock_scale_delta(delta, src->tsc_to_system_mul,
> >> > +					     src->tsc_shift);
> >> > +
> >> > +	} while (pvclock_read_retry(src, version));
> >> > +
> >> > +	preempt_enable_notrace();
> >> > +
> >> > +	tspec.tv_nsec = tspec.tv_nsec + offset;
> >> > +
> >> > +	spin_unlock(&kvm_ptp_lock);
> >> > +
> >> > +	if (tspec.tv_nsec >= NSEC_PER_SEC) {
> >> > +		u64 secs = tspec.tv_nsec;
> >> > +
> >> > +		tspec.tv_nsec = do_div(secs, NSEC_PER_SEC);
> >> > +		tspec.tv_sec += secs;
> >> > +	}
> >> > +
> >> > +	memcpy(ts, &tspec, sizeof(struct timespec64));
> >> 
> >> But the whole idea is of improving the time by reading tsc a bit later
> >> is just weird ... why is it better to provide
> >> 
> >>   tsc + x, time + tsc_delta_to_time(x)
> >> 
> >> than just
> >> 
> >>  tsc, time
> >> 
> >> ?
> > 
> > Because you want to calculate the value of the host realtime clock 
> > at the moment of ptp_kvm_gettime.
> > 
> > We do:
> > 
> > 	1. kvm_hypercall.
> > 	2. get {sec, nsec, guest_tsc}.
> > 	3. kvm_hypercall returns.
> > 	4. delay = rdtsc() - guest_tsc.
> > 
> > Where delay is the delta (measured with the TSC) between points 2 and 4.
> 
> I see now ... the PTP interface is just not good for our purposes.
> We don't return {sec, nsec, guest_tsc}, we just return {sec, nsec} at
> some random time in the past.  And to make it a bit more accurate, you
> add a best-effort delta before returning, which makes sense.

Not random time in the past. We return {sec, nsec} from the host
realtime at the moment the user ran the hypercall. 

Since PTP is very accurate, that "a bit more" counts, yes.

> When we have to depend on pvclock, what are the advantages of not using
> the existing pvclock API for wall clock?
> (You mentioned some extensions.)
> 
>   struct pvclock_wall_clock {
>   	u32   version;
>   	u32   sec;
>   	u32   nsec;
>   } __attribute__((__packed__));

> It gives the wall clock when pvclock was 0, so you just add current
> kvmclock and get the host wall clock.  

Well, no. For one, the TSC part of kvmclock: 

	kvmclock-read = system_timestamp + convert-to-1GHz(rdtsc() - tsc_timestamp)
				           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Drifts relative to UTC. This part can be large.
The guests NTP is responsible for fixing
that drift of the guests realtime clock (talking about current setup, 
without KVM PTP driver).

Now, we want very high precision (less than 1us) for this
driver. Very small TSC drifts on a large delta defeat the purpose.

> Without a VM exit.

Huge performance is not an issue. Accuracy (how different from the host
realtime clock our "approximation" of the host realtime clock) is.

> And how often is ptp_kvm_gettime() usually called?

The PTP_SYS_OFFSET ioctl calls the following code in a loop:

struct ptp_sys_offset {
        unsigned int n_samples; /* Desired number of measurements. */
        unsigned int rsv[3];    /* Reserved for future use. */
        /*
         * Array of interleaved system/phc time stamps. The kernel
         * will provide 2*n_samples + 1 time stamps, with the last
         * one as a system time stamp.
         */
        struct ptp_clock_time ts[2 * PTP_MAX_SAMPLES + 1];
};

#define PTP_MAX_SAMPLES 25 /* Maximum allowed offset measurement
samples. */

        case PTP_SYS_OFFSET:
                sysoff = memdup_user((void __user *)arg,
sizeof(*sysoff));
                if (IS_ERR(sysoff)) {
                        err = PTR_ERR(sysoff);
                        sysoff = NULL;
                        break;
                }
                if (sysoff->n_samples > PTP_MAX_SAMPLES) {
                        err = -EINVAL;
                        break;
                }
                pct = &sysoff->ts[0];
                for (i = 0; i < sysoff->n_samples; i++) {
                        getnstimeofday64(&ts);
                        pct->sec = ts.tv_sec;
                        pct->nsec = ts.tv_nsec;
                        pct++;
                        ptp->info->gettime64(ptp->info, &ts);
                        pct->sec = ts.tv_sec;
                        pct->nsec = ts.tv_nsec;
                        pct++;
                }
                getnstimeofday64(&ts);
                pct->sec = ts.tv_sec;
                pct->nsec = ts.tv_nsec;

How often that ioctl is called depends on the parameters of the Chrony
PHC code. Initially (to determine the clock difference Chrony should call it 
more frequently, later on it should call it less frequency).

Perhaps once every second initially (the ioctl). I'll confirm with the
exact value for my setup and reply to this email.

> 
> Thanks.
> 
> >> Because we'll always be quering the time at tsc + y, where y >> x, and
> >> we'd likely have other problems if shifting the time base by few
> >> thousand cycles made a difference.
> > 
> > Radim, i didnt get your "tsc + x", "time + tsc_delta_to_time(x)"
> > formulas above. Can you be more verbose please?
> 
> x is the delta, tsc_delta_to_time() is what pvclock_scale_delta() does.
> 
> I assumed that we set precise time with TSC, so the delta wouldn't
> matter, because PTP would either get {sec, nsec, guest_tsc}, or the
> same, but just shifted by delta, hence
> {sec  + tsc_delta_to_time(x) / NSEC_PER_SEC,
>  nsec + tsc_delta_to_time(x) % NSEC_PER_SEC,
>  guest_tsc + x}.

Ah, OK. I understand you now understood the meaning of "tsc"
part of the {sec, nsec, guest_tsc} triple.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html