Re: VDSO pvclock may increase host cpu consumption, is this a problem?

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Tue, 1 Apr 2014 17:46:34 -0700

On Tue, Apr 1, 2014 at 5:29 PM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
> On Tue, Apr 01, 2014 at 12:17:16PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 1, 2014 at 11:01 AM, Marcelo Tosatti <mtosatti@xxxxxxxxxx> wrote:
>> > On Mon, Mar 31, 2014 at 10:33:41PM -0700, Andy Lutomirski wrote:
>> >> On Mar 31, 2014 8:45 PM, "Marcelo Tosatti" <mtosatti@xxxxxxxxxx> wrote:
>> >> >
>> >> > On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
>> >> > > On 03/29/2014 01:47 AM, Zhanghailiang wrote:
>> >> > > > Hi,
>> >> > > > I found when Guest is idle, VDSO pvclock may increase host consumption.
>> >> > > > We can calcutate as follow, Correct me if I am wrong.
>> >> > > >       (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
>> >> > > > In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 250 Hz, it may consume 225,000 cycles per second, even no VM is created.
>> >> > > > In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per call. The feature decrease 150 cycles consumption per call.
>> >> > > > When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the host consumption.
>> >> > > > Both Host and Guest is linux-3.13.6.
>> >> > > > So, whether the host cpu consumption is a problem?
>> >> > >
>> >> > > Does pvclock serve any real purpose on systems with fully-functional
>> >> > > TSCs?  The x86 guest implementation is awful, so it's about 2x slower
>> >> > > than TSC.  It could be improved a lot, but I'm not sure I understand why
>> >> > > it exists in the first place.
>> >> >
>> >> > VM migration.
>> >>
>> >> Why does that need percpu stuff?  Wouldn't it be sufficient to
>> >> interrupt all CPUs (or at least all cpus running in userspace) on
>> >> migration and update the normal timing data structures?
>> >
>> > Are you suggesting to allow interruption of the timekeeping code
>> > at any time to update frequency information ?
>>
>> I'm not sure what you mean by "interruption of the timekeeping code".
>> I'm suggesting sending an interrupt to the guest (via a virtio device,
>> presumably) to tell it that it has been paused and resumed.
>>
>> This is probably worth getting John's input if you actually want to do
>> this.  I'm not about to :)
>
> Honestly, neither am i at the moment. But i'll think about it.
>
>> Is there any case in which the TSC is stable and the kvmclock data for
>> different cpus is actually different?
>
> No. However, kvmclock_data.flags field is an interface for watchdog
> unpause.
>
>> > Do you want to that as a special tsc clocksource driver ?
>> >
>> >> Even better: have the VM offer to invalidate the physical page
>> >> containing the kernel's clock data on migration and interrupt one CPU.
>> >>  If another CPU races, it'll fault and wait for the guest kernel to
>> >> update its timing.
>> >
>> > Perhaps that is a good idea.
>> >
>> >> Does the current kvmclock stuff track CLOCK_MONOTONIC and
>> >> CLOCK_REALTIME separately?
>> >
>> > No. kvmclock counting is interrupted on vm pause (the "hw" clock does not
>> > count during vm pause).
>>
>> Makes sense.
>>
>> >
>> >> > Can you explain why you consider it so bad ? How you think it could be
>> >> > improved ?
>> >>
>> >> The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
>> >> available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
>> >> getcpu call.
>> >>
>> >> It would also be nice to avoid having two sets of rescalings of the timing data.
>> >
>> > Yep, probably good improvements, patches are welcome :-)
>> >
>>
>> I may get to it at some point.  No guarantees.  I did just rewrite all
>> the mapping-related code for every other x86 vdso timesource, so maybe
>> I should try to add this to the pile.  The fact that the data is a
>> variable number of pages makes it messy, though, and since I don't
>> understand why there's a separate structure for each CPU, I'm hesitant
>> to change it too much.
>>
>> --Andy
>
> kvmclock.data? Because each VCPU can have different .flags fields for
> example.

It looks like the vdso kvmclock code only runs if
PVCLOCK_TSC_STABLE_BIT is set, which in turn is only the case if the
TSC is guaranteed to be monotonic across all CPUs.  If we can rely on
the fact that that bit will only be set if tsc_to_system_mul and
tsc_shift are the same on all CPUs and that (system_time -
(tsc_timestamp * mul) >> shift) is the same on all CPUs, then there
should be no reason for the vdso to read the pvclock data for anything
but CPU 0.  That will make it a lot faster and simpler.

Can we rely on that?

I wonder what happens if the guest runs ntpd or otherwise uses
adjtimex.  Presumably it starts drifting relative to the host.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html