Re: [Qemu-devel] [PATCHv4 3/4] cpuid: disable pv eoi for 1.1 and older compat types

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 29 Aug 2012 16:18:12 +0300

On Wed, Aug 29, 2012 at 09:56:12AM -0300, Eduardo Habkost wrote:
> On Wed, Aug 29, 2012 at 01:06:32PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 28, 2012 at 08:50:09PM -0300, Eduardo Habkost wrote:
> > > On Wed, Aug 29, 2012 at 01:25:35AM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 29, 2012 at 01:21:13AM +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Aug 28, 2012 at 07:02:42PM -0300, Eduardo Habkost wrote:
> > > > > > On Wed, Aug 29, 2012 at 12:35:28AM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Aug 28, 2012 at 04:13:38PM -0300, Eduardo Habkost wrote:
> > > > > > > > On Tue, Aug 28, 2012 at 08:43:52PM +0300, Michael S. Tsirkin wrote:
> > > > > > > > > In preparation for adding PV EOI support, disable PV EOI by default for
> > > > > > > > > 1.1 and older machine types, to avoid CPUID changing during migration.
> > > > > > > > > 
> > > > > > > > > PV EOI can still be enabled/disabled by specifying it explicitly.
> > > > > > > > >     Enable for 1.1
> > > > > > > > >     -M pc-1.1 -cpu kvm64,+kvm_pv_eoi
> > > > > > > > >     Disable for 1.2
> > > > > > > > >     -M pc-1.2 -cpu kvm64,-kvm_pv_eoi
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > What about users that are already running "qemu-1.1 -M pc-1.1" on a host
> > > > > > > > kernel that supports PV EOI already? They would get PV EOI disabled when
> > > > > > > > migrating to a destination running "qemu-1.2 -M pc-1.1".
> > > > > > > > 
> > > > > > > > (On the other hand, people running "qemu-1.1 -M pc-1.1" on a host kernel
> > > > > > > > supporting PV EOI already have migration broken, so there's not much we
> > > > > > > > can do for them)
> > > > > > > 
> > > > > > > Exactly.
> > > > > > > 
> > > > > > > Talked to Gleb, long term I think we should rework code to make
> > > > > > > it forward-compatible wrt adding new MSRs:
> > > > > > > - source gets list of MSRs to be migrated from KVM and simply sends them all
> > > > > > > - send all MSRS in key/value format
> > > > > > > - destination gets list of MSRs to be migrated from KVM and
> > > > > > >   only restores the supported ones
> > > > > > 
> > > > > > As far as I understand the migration code requirements/expectations, if
> > > > > > the origin is sending some data, it is because it is part of the
> > > > > > guest-visible machine state that must be kept while migrating. Because
> > > > > > of that, the destination is not allowed to drop anything it doesn't know
> > > > > > about.
> > > > > 
> > > > > 
> > > > > We have a ton of code that reads in values then just
> > > > > ignores them, for compat with old qemu.
> > > 
> > > The difference is that you ignore only the values you _know_ to be safe
> > > to be ignored. You can't ignore a value simply because you don't know
> > > what it means, and if it's important or not.
> > > > > This will be exactly such a case:
> > > > > we don't drop anything - protocol does not
> > > > > support this. We read and simply do not tell kvm
> > > > > about it.
> > > 
> > > This fits the definition of "dropping", to me.
> > > 
> > > > We also have tons of code that sends
> > > > > useless values again for compatibility.
> > > 
> > > But these values should be ignored only if the receiver knows exactly
> > > what it means, and knows exactly why/when it can be ignored.
> > 
> > Sorry I just do not understand these meta arguments.  Do you mean
> > the example below? If yes let us focus on it please.
> 
> We have to have these meta arguments, because the problem is about
> features/MSRs that will be introduced in the future. But okay, let's
> follow with the specific argument:
> 
> > 
> > > > > 
> > > > > > At the same time, if it's not part of guest-visible machine
> > > > > > state, it doesn't have to be sent by the migration origin.
> > > 
> > > It may be not directly visible, but if it's part of the machine state,
> > > it affects the guest in some way. If it didn't affect the guest in any
> > > way, we wouldn't even have to send it while migrating, in the first
> > > place.
> > > 
> > > > > > 
> > > > > 
> > > > > False, we often send internal device state which is not
> > > > > directly guest visible.
> > > > 
> > > > Besides, all MSRs in KVM's MSR list are actually guest visible, even if
> > > > guests normally do not invoke these MSRs - they can.
> > > 
> > > The difference is that the machine doesn't guarantee a MSR to be
> > > migrated unless the corresponding feature bit reports the MSR as
> > > supported.
> > > 
> > > e.g. using the PV EOI MSR without having the PV EOI CPUID bit set is
> > > undefined behavior. The guest shouldn't do it.
> > 
> > So the problematic scenario involves a guest that sets PV EOI when CPUID
> > bit is off on source. Why do we even care what happens on migration?
> 
> This is the case where there is _no_ problem. But the migration
> destination can't know that, how can it decide if it's safe to drop the
> MSR value or not?

I looked and it seems always safe to drop MSRs that we have.
Any counter examples.

> > 
> > > On the other hand, if the PV EOI CPUID bit is set, the MSR should be
> > > always kept, no matter what.
> > > 
> > > This means that we need to migrate the MSR only if the corresponding
> > > CPUID bit is enabled.
> > > 
> > 
> > I do not follow. Why does it mean that?
> > It seems completely safe to migrate this MSR no matter what.
> 
> It is completely safe to migrate the MSR, but it is completely unsafe to
> _ignore_ the incoming MSR value. That's the problem.

Normally CPUID will tell you if such important MSR is available.
So we can check that at destination.

> > 
> > > > 
> > > > > > On the other hand, a mode of operation that doesn't require updating
> > > > > > QEMU every time there's a new bit of guest-visible state to be migrated
> > > > > > would be nice (just like the "-cpu host" mode, that doesn't require
> > > > > > updating QEMU for every new CPU feature, is nice for some use cases). I
> > > > > > just don't know how to make work with the current migration protocol.
> > > > > > 
> > > > > 
> > > > > I don't understand. What is the problem with the proposal?
> > > > > What will not work with our protocol?
> > > > > Can you give an example please?
> > > 
> > > Yes:
> > > 
> > > Suppose kernel 3.7 introduces KVM_FOO_MSR and CPUID_KVM_FOO.
> > > 
> > > Also, suppose QEMU 1.2 doesn't know anything about KVM_FOO, because it
> > > was release before this feature was introduced.
> > > 
> > > 
> > > Then you run "qemu-1.2 -M pc-1.2" on a 3.7 host kernel. qemu-1.2 can do
> > > two things here:
> > > 
> > > 1) Not enable CPUID_KVM_FOO
> > > 
> > > In this case, the guest should not use KVM_FOO_MSR and the MSR does not
> > > need to be migrated (the guest may try to use it, but the behavior when
> > > trying to use it is undefined). Sending the MSR value when migrating
> > > would be useless.
> > > 
> > > 
> > > 2) Enable CPUID_KVM_FOO.
> > > 
> > > In this case, the guest may try to use the feature and write something
> > > into KVM_FOO_MSR. Sending the MSR value when migrating is absolutely
> > > necessary
> > > 
> > > ---
> > > 
> > > Now assume you run "qemu-1.2 -M pc-1.2" in the destination host, running
> > > the 3.6 kernel (without KVM_FOO).
> > > 
> > > Then qemu-1.2 receives the KVM_FOO_MSR data in the migration stream. In
> > > this case, qemu-1.2 simply can't decide if it's safe to drop the data
> > > (and not tell KVM about it), or not.
> > > If we simply send every MSR reported by the kernel, both the origin and
> > > destination qemu-1.2 processes can't ever know if the MSR value is
> > > important or not, because it doesn't know if it's part of the machine
> > > state that has to be kept consistent.
> > > We could introduce a mode of operation where _every_ MSR reported by KVM
> > > is important and sent by the origin (and also where every MSR must be
> > > handled by the destination, otherwise migration would fail). But this
> > > new mode would break migration compatibility between two hosts running
> > > the same machine-type, only because they are running different kernel
> > > versions. But it may be useful for some use cases, so maybe it's
> > > appropriate for a future "pc-next" machine-type (and for "-cpu host"),
> > > but not for "pc-<version>".
> > > 
> > 
> > In this example, we should migrate CPUID (don't we?).
> > Destination should validate that CPUID supplied by source
> > matches what it can support (doesn't it?).
> > 
> > If we do and it does not, it's an unrelated bug:
> > CPUID changing under guest's feet.
> 
> CPUID changing under guest's feet is another problem, that we also have
> to solve.
> But we also have the problem of migration compatibility
> between different host kernels.

So here is the solution for both: on destination pass CPUID to kvm and
it should come back unchanged.  If it changed you fail migration.

> 
> Note that I am not saying that migrating all MSRs is wrong. It _is_
> good, as long as:
> - The destination never ignores any of the incoming MSR values.

What I am saying for MSRs added in last 2 years it is OK to ignore
because CPUID check will tell you if it is supported
and fail migration.

> - We don't do that by default on "pc-<version>", or we defeat the
>   purpose of machine-types.
> 
> I'll try to enumerate the problems I am trying to address (that are
> related, but are separate problems):
> 
> - MSR not being migrated when it should:
>   - Possible solution: migrate all MSRs even if qemu doesn't know what
>     they are.
>     - Constraint: migration destination must _never_ ignore any incoming
>       MSR value, because it can't decide if it is important to the guest
>       or not (with the current KVM interfaces).
>     - Problem with this solution: if we do that by default on
>       "pc-<version>", we break migration compatibility between hosts
>       with different kernel versions.

Solution: add vcpu ioctl that tells you which MSRs to migrate
(on source), depending on CPUID.

> - Changing CPUID bits under guest's feet.
>   - Proposed solution: migrating CPUID bits, refuse migration if
>     destination doesn't support the same bits.
>     - It solves the compatibility problem for migration to a newer
>       kernel, but not to an older kernel. It helps to solve part of
>       the problem, but not all.

How does not it save all of the problem? If destination kernel
can support cpuid, then we are fine - it is new enough.

>   - Alternative solution: simply make the resulting CPUID bits not be a
>     function of the host kernel capabilities, but only of the qemu
>     configuration (except on "-cpu host" and "-M pc-next").

This perpetuates existing duplication of code between
kvm and qemu. We are better off with logic in 1 place.

> - Migration compatibility/predictability:
>   - See my example above: feature introduced in a newer kernel,
>     migration to an older kernel.

If it is enabled then migration fails.

>   - The only way I see to guarantee this is to never enable unknown
>     CPUID bits or MSRs. People who don't care about predictable
>     migration compatibility can use "-M pc-next", then.
> 

Guarantee what?
Just check dst can support all msrs and cpuid bits of src.
Way to check is to ask kvm :) Not to add logic in qemu.

> > 
> > > > > 
> > > > > 
> > > > > > > Too late for 1.2?
> > > > > > 
> > > > > > Absolutely (in my opinion).
> > > > > > 
> > > > > > > 
> > > > > > > > While we don't make the KVM feature-bit handling sane (with defaults
> > > > > > > > that are not blindly derived from the host kernel capabilities), maybe
> > > > > > > > the safest bet is to expect users to not migrate between hosts running
> > > > > > > > kernels with different KVM capabilities? (I am not sure which option is
> > > > > > > > better)
> > > > > > > 
> > > > > > > Sorry not sure what you talk about here. What has KVM feature-bit
> > > > > > > handling to do with this patchset?
> > > > > > 
> > > > > > Everything? The whole point of this patch is to filter out the PV_EOI
> > > > > > KVM feature bit.
> > > > > > 
> > > > > > 
> > > > > > This part of the current code, specifically, is wrong:
> > > > > > 
> > > > > > > > >      plus_kvm_features = ~0; /* not supported bits will be filtered out later */
> > > > > > 
> > > > > > The QEMU-side list of KVM features should be whitelist-based, not
> > > > > > blacklist-based (unless the user doesn't need migration, in that case he
> > > > > > can use "-cpu host" and get every feature blindly enabled), because QEMU
> > > > > > can't know if a new feature involves guest-visible state that has to be
> > > > > > migrated.
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> > > > > > > > > ---
> > > > > > > > >  hw/Makefile.objs  |  2 +-
> > > > > > > > >  hw/cpu_flags.c    | 32 ++++++++++++++++++++++++++++++++
> > > > > > > > >  hw/cpu_flags.h    |  9 +++++++++
> > > > > > > > >  hw/pc_piix.c      |  2 ++
> > > > > > > > >  target-i386/cpu.c |  8 ++++++++
> > > > > > > > >  5 files changed, 52 insertions(+), 1 deletion(-)
> > > > > > > > >  create mode 100644 hw/cpu_flags.c
> > > > > > > > >  create mode 100644 hw/cpu_flags.h
> > > > > > > > > 
> > > > > > > > > diff --git a/hw/Makefile.objs b/hw/Makefile.objs
> > > > > > > > > index 850b87b..3f2532a 100644
> > > > > > > > > --- a/hw/Makefile.objs
> > > > > > > > > +++ b/hw/Makefile.objs
> > > > > > > > > @@ -1,5 +1,5 @@
> > > > > > > > >  hw-obj-y = usb/ ide/
> > > > > > > > > -hw-obj-y += loader.o
> > > > > > > > > +hw-obj-y += loader.o cpu_flags.o
> > > > > > > > >  hw-obj-$(CONFIG_VIRTIO) += virtio-console.o
> > > > > > > > >  hw-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
> > > > > > > > >  hw-obj-y += fw_cfg.o
> > > > > > > > > diff --git a/hw/cpu_flags.c b/hw/cpu_flags.c
> > > > > > > > > new file mode 100644
> > > > > > > > > index 0000000..7a633c0
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/hw/cpu_flags.c
> > > > > > > > > @@ -0,0 +1,32 @@
> > > > > > > > > +/*
> > > > > > > > > + * CPU compatibility flags.
> > > > > > > > > + *
> > > > > > > > > + * Copyright (c) 2012 Red Hat Inc.
> > > > > > > > > + * Author: Michael S. Tsirkin.
> > > > > > > > > + *
> > > > > > > > > + * This program is free software; you can redistribute it and/or modify
> > > > > > > > > + * it under the terms of the GNU General Public License as published by
> > > > > > > > > + * the Free Software Foundation; either version 2 of the License, or
> > > > > > > > > + * (at your option) any later version.
> > > > > > > > > + *
> > > > > > > > > + * This program is distributed in the hope that it will be useful,
> > > > > > > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > > > > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > > > > > > + * GNU General Public License for more details.
> > > > > > > > > + *
> > > > > > > > > + * You should have received a copy of the GNU General Public License along
> > > > > > > > > + * with this program; if not, see <http://www.gnu.org/licenses/>.
> > > > > > > > > + */
> > > > > > > > > +#include "hw/cpu_flags.h"
> > > > > > > > > +
> > > > > > > > > +static bool kvm_pv_eoi_disabled_state;
> > > > > > > > > +
> > > > > > > > > +void disable_kvm_pv_eoi(void)
> > > > > > > > > +{
> > > > > > > > > +       kvm_pv_eoi_disabled_state = true;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +bool kvm_pv_eoi_disabled(void)
> > > > > > > > > +{
> > > > > > > > > +       return kvm_pv_eoi_disabled_state;
> > > > > > > > > +}
> > > > > > > > > diff --git a/hw/cpu_flags.h b/hw/cpu_flags.h
> > > > > > > > > new file mode 100644
> > > > > > > > > index 0000000..05777b6
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/hw/cpu_flags.h
> > > > > > > > > @@ -0,0 +1,9 @@
> > > > > > > > > +#ifndef HW_CPU_FLAGS_H
> > > > > > > > > +#define HW_CPU_FLAGS_H
> > > > > > > > > +
> > > > > > > > > +#include <stdbool.h>
> > > > > > > > > +
> > > > > > > > > +void disable_kvm_pv_eoi(void);
> > > > > > > > > +bool kvm_pv_eoi_disabled(void);
> > > > > > > > > +
> > > > > > > > > +#endif
> > > > > > > > > diff --git a/hw/pc_piix.c b/hw/pc_piix.c
> > > > > > > > > index 008d42f..bdbceda 100644
> > > > > > > > > --- a/hw/pc_piix.c
> > > > > > > > > +++ b/hw/pc_piix.c
> > > > > > > > > @@ -46,6 +46,7 @@
> > > > > > > > >  #ifdef CONFIG_XEN
> > > > > > > > >  #  include <xen/hvm/hvm_info_table.h>
> > > > > > > > >  #endif
> > > > > > > > > +#include "cpu_flags.h"
> > > > > > > > >  
> > > > > > > > >  #define MAX_IDE_BUS 2
> > > > > > > > >  
> > > > > > > > > @@ -371,6 +372,7 @@ static QEMUMachine pc_machine_v1_2 = {
> > > > > > > > >  
> > > > > > > > >  static void pc_machine_v1_1_compat(void)
> > > > > > > > >  {
> > > > > > > > > +    disable_kvm_pv_eoi();
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > >  static void pc_init_pci_v1_1(ram_addr_t ram_size,
> > > > > > > > > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > > > > > > > > index 120a2e3..0d02fd1 100644
> > > > > > > > > --- a/target-i386/cpu.c
> > > > > > > > > +++ b/target-i386/cpu.c
> > > > > > > > > @@ -23,6 +23,7 @@
> > > > > > > > >  
> > > > > > > > >  #include "cpu.h"
> > > > > > > > >  #include "kvm.h"
> > > > > > > > > +#include "asm/kvm_para.h"
> > > > > > > > >  
> > > > > > > > >  #include "qemu-option.h"
> > > > > > > > >  #include "qemu-config.h"
> > > > > > > > > @@ -33,6 +34,7 @@
> > > > > > > > >  #include "hyperv.h"
> > > > > > > > >  
> > > > > > > > >  #include "hw/hw.h"
> > > > > > > > > +#include "hw/cpu_flags.h"
> > > > > > > > >  
> > > > > > > > >  /* feature flags taken from "Intel Processor Identification and the CPUID
> > > > > > > > >   * Instruction" and AMD's "CPUID Specification".  In cases of disagreement
> > > > > > > > > @@ -889,6 +891,12 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, const char *cpu_model)
> > > > > > > > >  
> > > > > > > > >      plus_kvm_features = ~0; /* not supported bits will be filtered out later */
> > > > > > > > >  
> > > > > > > > > +    /* Disable PV EOI for old machine types.
> > > > > > > > > +     * Feature flags can still override. */
> > > > > > > > > +    if (kvm_pv_eoi_disabled()) {
> > > > > > > > > +        plus_kvm_features &= ~(0x1 << KVM_FEATURE_PV_EOI);
> > > > > > > > > +    }
> > > > > > > > > +
> > > > > > > > >      add_flagname_to_bitmaps("hypervisor", &plus_features,
> > > > > > > > >          &plus_ext_features, &plus_ext2_features, &plus_ext3_features,
> > > > > > > > >          &plus_kvm_features, &plus_svm_features);
> > > > > > > > > -- 
> > > > > > > > > MST
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > -- 
> > > > > > > > Eduardo
> > > > > > 
> > > > > > -- 
> > > > > > Eduardo
> > > 
> > > -- 
> > > Eduardo
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html