On Wed, Oct 19, 2016 at 05:42:16PM +0200, Radim Krčmář wrote: > 2016-10-19 11:55-0200, Eduardo Habkost: > > On Wed, Oct 19, 2016 at 03:27:52PM +0200, Radim Krčmář wrote: > >> 2016-10-18 19:05-0200, Eduardo Habkost: > >> > On Tue, Oct 18, 2016 at 10:52:14PM +0200, Radim Krčmář wrote: > >> > [...] > >> >> The main problem is that QEMU changes virtual_tsc_khz when migrating > >> >> without hardware scaling, so KVM is forced to get nanoseconds wrong ... > >> >> > >> >> If QEMU doesn't want to keep the TSC frequency constant, then it would > >> >> be better if it didn't expose TSC in CPUID -- guest would just use > >> >> kvmclock without being tempted by direct TSC accesses. > >> > > >> > Isn't enough to simply not expose invtsc? Aren't guests expected > >> > to assume the TSC frequency can change if invtsc isn't set on > >> > CPUID? > >> > >> There are exceptions. An OS can assume constant TSC on some models that > >> QEMU emulates: coreduo, core2duo, Conroe, Penryn, n270, kvm32 and kvm64. > >> The list from SDM (17.15 TIME-STAMP COUNTER): > >> > >> Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H > >> and higher]); Intel Core Solo and Intel Core Duo processors (family > >> [06H], model [0EH]); the Intel Xeon processor 5100 series and Intel > >> Core 2 Duo processors (family [06H], model [0FH]); Intel Core 2 and > >> Intel Xeon processors (family [06H], DisplayModel [17H]); Intel Atom > >> processors (family [06H], DisplayModel [1CH])) > >> > >> Another sad part is that Linux uses the following condition to assume > >> constant TSC frequency: > >> > >> if ((c->x86 == 0xf && c->x86_model >= 0x03) || > >> (c->x86 == 0x6 && c->x86_model >= 0x0e)) > >> set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC); > >> > >> which returns sets constant TSC for all modern processors. It's not a > >> problem on real hardware, because all modern processors likely have > >> invariant TSC. > >> > >> Fun fact: Linux shows constant_tsc flag in /proc/cpuinfo even if the > >> modern CPU doesn't expose TSC in CPUID. > >> > >> Considering that Linux is fixed on Nehalem and newer processors, we have > >> few options for the rest: > >> 1) treat TSC like invariant TSC on those models (the guest cannot use > >> ACPI state, so its OS might assume that they are equivalent) > >> 2) hide TSC on those models > >> 3) ignore the problem > >> 4) remove those models > >> > >> I don't know enough about QEMU design goals to guess which one is the > >> most appropriate. (4) is the clear winner for me, followed by (3). :) > > > > (4) can't be implemented because it breaks existing > > configurations. (3) is the current solution. > > Existing machine types must remain compatible, but isn't it possible to > cull options in new machine types? We specifically promised to libvirt developers that a CPU model that can be started with a machine-type should be still runnable with other versions of the same machine-type family. In other words, a running config should keep working if only the machine-type version changed. > > > Option (2) sounds attractive to me, but seems risky. > > Definitely. > If users have a setup that works, then any change can break it. > > It would be the best option few years back when we wrote the code, but > now the change will happen *in* the guest, so we can't control it as in > the case of (4), where broken guests won't start, or (1), where broken > guests won't migrate. > > > I would like > > to understand the consequences for guests. What could stop > > working if we remove TSC? What about kvmclock? > > Hiding TSC in CPUID doesn't disable the RDTSC instruction in the guest. > > kvmclock is a paravirtual device on top of TSC, so if kvmclock is > present, then it should be safe to assume that the guest can use TSC for > operations with kvmclock. > Linux does that, but I don't think this behavior was ever written down, > so other kvmclock users could break. > > Maybe Hyper-V TSC page would stop working, because Windows and other > users could have a check for CPUID.1:EDX.TSC separately. > Linux's implemention would work, because it just checks for the > paravirtual feature, like in case of kvmclock. > > And minor cases are: an OS that has no other option that TSC for clock; > userspace that checks TSC before using it; an OS that stops setting > CR4.TSD and its userspace starts to use TSC; and probably many others. OK, that sounds very risky. This means it is probably better to let management software explicitly choose the new stricter behavior. ...and we already have a mechanism to request stricter behavior: explicitly disabling TSC, or setting tsc-frequency explicitly on the command-line. > > > If we implement (2), we could even add an extra check that blocks > > migration (or at least prints a warning) in case: > > 1) TSC is forcibly enabled in the configuration; > > 2) TSC scaling is not available on destination; and > > 3) the family/model values match the ones on the list above. > > > > And we could even keep TSC enabled by default for users who don't > > want migration (using migratable=false). > > That would be nice. We already print a warning if there's TSC frequency mismatch without TSC scaling. I wonder if we should reduce false positives by printing it only when family/model is on the list above (or if invtsc is enabled). -- Eduardo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html