On Fri, Mar 05, 2010 at 06:50:47AM -1000, Zachary Amsden wrote: > On 03/05/2010 04:27 AM, Daniel P. Berrange wrote: > > > > * HPET > > Multiple timers with periodic interrupts > > Can replace PIT/RTC timers > > > >They all generally suck in real hardware, and this gets worse in virtual > >machines. > >Many different approaches to making them suck less in VMWare, Xen& KVM, > >but there > >are some reasonably common concepts.... > > > > HPET doesn't suck. The VMWare timekeeping docs mentions that it has timeout race conditions, poorly defined spec for timer granularity, drift & speed of access, & bad implementations in the real world which I read as 'sucks' ;-) > > * Interrupt timers > > > > - Ticks can not always be delivered on time > > > > Policies to deal with "missed" ticks: > > > > 1. Deliver at normal rate without catchup > > 2. Deliver at higher rate to catch up > > 3. Merge into 1 tick& deliver asap > > > > 4. Discard all missed ticks > > > > The issue is actually more complex than just these policies. A naive > implementation of the policy leads to a guest DOS of the host. > > We actually have such a bug, and it demands a policy which merges ticks > over a certain threshold and does not deliver ASAP. It's tricky and > complex to fix because it means our notion of timers for the guest is > wrong, and we need to introduce a higher order scheduling behaviour. > > In general, there isn't much we can tune here, but what we can tune is > whether the other counters (RTC / HPET / TSC / ACPI) stay in sync with > ticks delivered. It's not perfect or completely well defined because > the tick can't actually be delivered until a fairly complex set of > hardware rules is obeyed. This may not be apparent now, because it gets > worse as we implement more hardware support for NMIs and SMIs. An ideal > solution would sync the other counters when the tick is generated, not > when it is injected. However, this leads us back to the DOS attack. > There are also problems with SMP timing here (which CPU gets timer > interrupts can change, and are they broadcast?). These problems are > made worse because we don't gang schedule. FYI, I wasn't trying to suggest good / bad policies here. I was just attempting to document the policies that I see have been implemented so far. For the libvirt XML the key issue is to identify a way to list possible policies that can be extended as new one appear in hypervisors. > > * TSC > > - rdtsc instruction can be exposed to guests in two ways > > > > 1. Trap + emulate (slow, but more reliable) > > 2. Native (fast, but possibly unreliable) > > > > Optionally also expose a 'rdtscp' instruction > > > > Possiblly set a fixed HZ independant of host. > > > > There is also > > 3) a mixed approach; trap and emulate only when required, allow native > access and offset appropriately at each exit; and > > 4) a SMP safe approach; trap and emulate always, and interlock SMP > access to the clock so it is globally consistent > > 5) a secure approach; trap and emulate always and hide host time. This > precludes the possibility of SMP, as timing differences can be observed > since we don't gang schedule. This obviously has implications for the > other timers. > > So this variable is not a simple boolean, but a multi-choice. Yep, I captured this increased range of options later after seeing that Xen has 4 possible choices now! > >------------------ > > > > * All timers run in "apparant time" ie track guest wallclock > > * Missed tick policy is to deliver at higher rate to catchup > > * TSC can be switched between native/emulate (virtual_rdtsc=TRUE|FALSE) > > * TSC can have hardcoded HZ in emulate mode (apparantHZ=VALUE) > > * RTC time of day is synced to host at startup (rtc.diffFromUTC or > > rtc.startTime) > > * VMWare tools reset guest TOD if it gets out of sync > > > > There is also lateness hiding; (timeTracker.hideLateness); adjust TSC to > compensate for lateness of injected interrupts (it's the slightly buggy > counter compensation at each tick I mention above). Thanks, I'd not see any reference to that one in the docs. > >Xen timekeeping > >--------------- > > > > * TSC. Can run in 4 modes > > > > - auto: emulate if host TSC is unstable. native with invariant TSC > > - native: always native regardless of host TSC stability > > - emulate: trap + emulate regardless of host TSC invariant > > - pvrdtsc: native, requiring invariant TSC. Also exposes rdtscp > > instruction > > > > TSC is complex enough without RDTSCP. Let's consider rdtscp as a host > optimization for vendors of hardware with buggy clocks who want fast > gettimeofday system calls. We already are compensating to try to keep > virtual TSC in sync on KVM and probably don't need this mode. I included rdtscp because it is one of the things that latest Xen 4.0 tree now implements, so we need to be able to represent it in the libvirt XML. > >Meaning of 'mode': > > > > Control how the clock is exposed to guest. > > > > auto: native if safe, otherwise emulate > > native: always native > > emulate: always emulate > > paravirt: native + paravirtualize > > > > NB: Only relevant for TSC. All other timers are always emulated. > > > > auto, native, emulate can map nicely for us, but it would be good to > have an smp safe mode. (A secure mode is more of a global setting for > all timers). For any of the enumerations I fully expect that we would add further allowed values to the libvirt XML over time. The goal is to get the baseline on current implementations & try to keep it easily extensible for future ideas > >Mapping to VMWare > >----------------- > > > >eg with guest config showing > > > > diffFromUTC='123456' > > apparentHZ='123456' > > virtual_rdtsc=False > > > >libvirt XML gets: > > > > <clock mode='variable' adjustment='123456'> > > <timer name='tsc' frequency='123456' mode='native'/> > > </clock> > > > > > >Mapping to Xen > >-------------- > > > >eg with guest config showing > > > > timer_mode=3 > > hpet=1 > > tsc_mode=2 > > localtime=1 > > > > <clock mode='localtime'> > > <timer name='platform' tickpolicy='merge' wallclock='host'/> > > <timer name='hpet'/> > > <timer name='tsc' mode='native'/> > > </clock> > > > > > >Mapping to KVM > >-------------- > > > >eg with guest ARGV showing > > > > -no-kvm-pit-reinjection > > -clock base=localtime,clock=guest,driftfix=slew > > -no-hpet > > > > > > <clock mode='localtime'> > > <timer name='rtc' tickpolicy='catchup' wallclock='guest'/> > > <timer name='pit' tickpolicy='none'/> > > <timer name='hpet' present='no'/> > > </clock> > > > > > > > >Further reading > >--------------- > > > >VMWare has the best doc: > > > > http://www.vmware.com/pdf/vmware_timekeeping.pdf > > > >Xen: > > > > Docs on 'tsc_mode' at > > > > $SOURCETREE/docs/misc/tscmode.txt > > > > Docs for 'timer_mode' in the source code only: > > > > xen/include/public/hvm/params.h > > > >KVM: > > > > No docs at all. Guess from -help descriptions, reading source code& > > asking > > clever people about it :-) > > > > Let me propose an XML mapping a bit later today. I haven't had coffee > yet, and we know what that can do. Ok, thanks for the feedback so far. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list