Re: VDSO pvclock may increase host cpu consumption, is this a problem?

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Wed, 2 Apr 2014 15:31:56 -0700

On Wed, Apr 2, 2014 at 3:05 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Tue, Apr 01, 2014 at 05:46:34PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 1, 2014 at 5:29 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> > On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> >> > On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
>> >> >> On Mar 31, 2014 8:45 PM, "Marcelo Tosatti" <mtosatti@xxxxxxxxxx> wrote:
>> >> >> >
>> >> >> > On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
>> >> >> > > On 03/29/2014 01:47 AM, Zhanghailiang wrote:
>> >> >> > > > Hi,
>> >> >> > > > I found when Guest is idle, VDSO pvclock may increase host consumption.
>> >> >> > > > We can calcutate as follow, Correct me if I am wrong.
>> >> >> > > >       (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
>> >> >> > > > In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created.
>> >> >> > > > In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call.
>> >> >> > > > When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption.
>> >> >> > > > Both Host and Guest is linux-3.13.6.
>> >> >> > > > So, whether the host cpu consumption is a problem?
>> >> >> > >
>> >> >> > > Does pvclock serve any real purpose on systems with fully-functional
>> >> >> > > TSCs?  The x86 guest implementation is awful, so it's about 2x slower
>> >> >> > > than TSC.  It could be improved a lot, but I'm not sure I understand why
>> >> >> > > it exists in the first place.
>> >> >> >
>> >> >> > VM migration.
>> >> >>
>> >> >> Why does that need percpu stuff?  Wouldn't it be sufficient to
>> >> >> interrupt all CPUs (or at least all cpus running in userspace) on
>> >> >> migration and update the normal timing data structures?
>> >> >
>> >> > Are you suggesting to allow interruption of the timekeeping code
>> >> > at any time to update frequency information ?
>> >>
>> >> I'm not sure what you mean by "interruption of the timekeeping code".
>> >> I'm suggesting sending an interrupt to the guest (via a virtio device,
>> >> presumably) to tell it that it has been paused and resumed.
>> >>
>> >> This is probably worth getting John's input if you actually want to do
>> >> this.  I'm not about to :)
>> >
>> > Honestly, neither am i at the moment. But i'll think about it.
>> >
>> >> Is there any case in which the TSC is stable and the kvmclock data for
>> >> different cpus is actually different?
>> >
>> > No. However, kvmclock_data.flags field is an interface for watchdog
>> > unpause.
>> >
>> >> > Do you want to that as a special tsc clocksource driver ?
>> >> >
>> >> >> Even better: have the VM offer to invalidate the physical page
>> >> >> containing the kernel's clock data on migration and interrupt one CPU.
>> >> >>  If another CPU races, it'll fault and wait for the guest kernel to
>> >> >> update its timing.
>> >> >
>> >> > Perhaps that is a good idea.
>> >> >
>> >> >> Does the current kvmclock stuff track CLOCK_MONOTONIC and
>> >> >> CLOCK_REALTIME separately?
>> >> >
>> >> > No. kvmclock counting is interrupted on vm pause (the "hw" clock does not
>> >> > count during vm pause).
>> >>
>> >> Makes sense.
>> >>
>> >> >
>> >> >> > Can you explain why you consider it so bad ? How you think it could be
>> >> >> > improved ?
>> >> >>
>> >> >> The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
>> >> >> available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
>> >> >> getcpu call.
>> >> >>
>> >> >> It would also be nice to avoid having two sets of rescalings of the timing data.
>> >> >
>> >> > Yep, probably good improvements, patches are welcome :-)
>> >> >
>> >>
>> >> I may get to it at some point.  No guarantees.  I did just rewrite all
>> >> the mapping-related code for every other x86 vdso timesource, so maybe
>> >> I should try to add this to the pile.  The fact that the data is a
>> >> variable number of pages makes it messy, though, and since I don't
>> >> understand why there's a separate structure for each CPU, I'm hesitant
>> >> to change it too much.
>> >>
>> >> --Andy
>> >
>> > kvmclock.data? Because each VCPU can have different .flags fields for
>> > example.
>>
>> It looks like the vdso kvmclock code only runs if
>> PVCLOCK_TSC_STABLE_BIT is set, which in turn is only the case if the
>> TSC is guaranteed to be monotonic across all CPUs.  If we can rely on
>> the fact that that bit will only be set if tsc_to_system_mul and
>> tsc_shift are the same on all CPUs and that (system_time -
>> (tsc_timestamp * mul) >> shift) is the same on all CPUs, then there
>> should be no reason for the vdso to read the pvclock data for anything
>> but CPU 0.  That will make it a lot faster and simpler.
>>
>> Can we rely on that?
>
> In theory yes, but you would have to handle
>
> PVCLOCK_TSC_STABLE_BIT set -> PVCLOCK_TSC_STABLE_BIT not set
>
> Transition (and the other way around as well).

Since !STABLE already results in a real syscall for clock_gettime and
gettimeofday, I don't think this is a real hardship for the vdso.

>
>> I wonder what happens if the guest runs ntpd or otherwise uses
>> adjtimex.  Presumably it starts drifting relative to the host.
>
> It should use ntpd and adjtimex.  KVMCLOCK is the "hw" clock,
> the values returned by CLOCK_REALTIME and CLOCK_GETTIME are built
> by the Linux guest timekeeping subsystem on top of the "hw" clock.
>

If the kernel can guarantee that, then the timing code gets faster,
since the cyc2ns scale will be unity.  Maybe this is worth a branch.

Anyway, I'll try to find some time to improve this if/when hpa picks
up my current series of vdso cleanups.  I suspect that the overall
effect will be a 30-40% speedup in clock_gettime along with a decent
reduction of code complexity.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html