Re: [PATCH] x86: Reset MTRR on vCPU reset

Laszlo Ersek <lersek@xxxxxxxxxx> · Wed, 13 Aug 2014 22:33:32 +0200

a number of comments -- feel free to address or ignore each as you see fit:

On 08/13/14 21:09, Alex Williamson wrote:
> The SDM specifies (June 2014 Vol3 11.11.5):
> 
>     On a hardware reset, the P6 and more recent processors clear the
>     valid flags in variable-range MTRRs and clear the E flag in the
>     IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
>     MTRRs are undefined.
> 
> We currently do none of that, so whatever MTRR settings you had prior
> to reset is what you have after reset.  Usually this doesn't matter
> because KVM often ignores the guest mappings and uses write-back
> anyway.  However, if you have an assigned device and an IOMMU that
> allows NoSnoop for that device, KVM defers to the guest memory
> mappings which are now stale after reset.  The result is that OVMF
> rebooting on such a configuration takes a full minute to LZMA
> decompress the EFI volume, a process that is nearly instant on the

For pedantry, instead of "EFI volume" we could say "LZMA-compressed
Firmware File System file in the FVMAIN_COMPACT firmware volume".

> initial boot.
> 
> Add support for reseting the SDM defined bits on vCPU reset.
> 
> Also, by my count we're already in danger of overflowing the entries
> array that we pass to KVM, so I've topped it up for a bit of headroom.
> 
> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Cc: qemu-stable@xxxxxxxxxx
> ---
> 
>  target-i386/cpu.c |    6 ++++++
>  target-i386/cpu.h |    4 ++++
>  target-i386/kvm.c |   14 +++++++++++++-
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 6d008ab..b5ae654 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2588,6 +2588,12 @@ static void x86_cpu_reset(CPUState *s)
>  
>      env->xcr0 = 1;
>  
> +    /* MTRR init - Clear global enable bit and valid bit in each variable reg */
> +    env->mtrr_deftype &= ~MSR_MTRRdefType_Enable;
> +    for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> +        env->mtrr_var[i].mask &= ~MSR_MTRRphysMask_Valid;
> +    }
> +

I can see that the limit, MSR_MTRRcap_VCNT, is #defined as 8. Would you
be willing to update the definition of the "CPUX86State.mtrr_var" array
too, in "target-i386/cpu.h"? Currently it says:

    MTRRVar mtrr_var[8];

>  #if !defined(CONFIG_USER_ONLY)
>      /* We hard-wire the BSP to the first CPU. */
>      if (s->cpu_index == 0) {
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index e634d83..139890f 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -337,6 +337,8 @@
>  #define MSR_MTRRphysBase(reg)           (0x200 + 2 * (reg))
>  #define MSR_MTRRphysMask(reg)           (0x200 + 2 * (reg) + 1)
>  
> +#define MSR_MTRRphysMask_Valid (1 << 11)
> +

Note: a signed integer (int32_t).

>  #define MSR_MTRRfix64K_00000            0x250
>  #define MSR_MTRRfix16K_80000            0x258
>  #define MSR_MTRRfix16K_A0000            0x259
> @@ -353,6 +355,8 @@
>  
>  #define MSR_MTRRdefType                 0x2ff
>  
> +#define MSR_MTRRdefType_Enable (1 << 11)
> +

Note: a signed integer (int32_t).

Now, if you scroll back to the bit-clearing in x86_cpu_reset(), you see

  ~MSR_MTRRdefType_Enable

and

 ~MSR_MTRRphysMask_Valid

These expressions evaluate to negative int (int32_t) values (because the
bit-neg sets their sign bits).

Due to two's complement (which we are allowed to assume in qemu, see
HACKING), the negative int32_t values will be just correct for the next
step, when they are converted to uint64_t for the bit-ands, as part of
the usual arithmetic conversions. ("env->mtrr_deftype" and
"env->mtrr_var[i].mask" are uint64_t.) Mathematically this means an
addition of UINT64_MAX+1. ("Sign extended".)

In general, even though they are correct due to two's complement, I
dislike such detours into negative-valued signed integers by way of
bit-neg, because people are mostly unaware of them and assume they "just
work". My preferred solution would be

#define MSR_MTRRphysMask_Valid (1ull << 11)
#define MSR_MTRRdefType_Enable (1ull << 11)

Feel free to ignore this of course.

>  #define MSR_CORE_PERF_FIXED_CTR0        0x309
>  #define MSR_CORE_PERF_FIXED_CTR1        0x30a
>  #define MSR_CORE_PERF_FIXED_CTR2        0x30b
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 097fe11..cb31338 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -79,6 +79,7 @@ static int lm_capable_kernel;
>  static bool has_msr_hv_hypercall;
>  static bool has_msr_hv_vapic;
>  static bool has_msr_hv_tsc;
> +static bool has_msr_mtrr;
>  
>  static bool has_msr_architectural_pmu;
>  static uint32_t num_architectural_pmu_counters;
> @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
>          env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
>      }
>  
> +    if (env->features[FEAT_1_EDX] & CPUID_MTRR) {
> +        has_msr_mtrr = true;
> +    }
> +

Seems to match "MTRR Feature Identification" in my (older) copy of the SDM.

>      return 0;
>  }
>  
> @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>      CPUX86State *env = &cpu->env;
>      struct {
>          struct kvm_msrs info;
> -        struct kvm_msr_entry entries[100];
> +        struct kvm_msr_entry entries[128];
>      } msr_data;
>      struct kvm_msr_entry *msrs = msr_data.entries;
>      int n = 0, i;
> @@ -1278,6 +1283,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_REFERENCE_TSC,
>                                env->msr_hv_tsc);
>          }
> +        if (has_msr_mtrr) {
> +            kvm_msr_entry_set(&msrs[n++], MSR_MTRRdefType, env->mtrr_deftype);
> +            for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> +                kvm_msr_entry_set(&msrs[n++],
> +                                  MSR_MTRRphysMask(i), env->mtrr_var[i].mask);
> +            }
> +        }
>  
>          /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
>           *       kvm_put_msr_feature_control. */
> 

I think that this code is correct (and sufficient for the reset
problem), but I'm uncertain if it's complete:

(a) Shouldn't you put the matching PhysBase registers as well (for the
variable range ones)?

Plus, shouldn't you put mtrr_fixed[11] too (MSR_MTRRfix64K_00000, ...)?

(b) You only modify kvm_put_msrs(). What about kvm_get_msrs()? I can see
that you make the msr putting dependent on:

    /*
     * The following MSRs have side effects on the guest or are too
     * heavy for normal writeback. Limit them to reset or full state
     * updates.
     */
    if (level >= KVM_PUT_RESET_STATE) {

But that's probably not your reason for omitting matching new code from
kvm_get_msrs(): "HV_X64_MSR_REFERENCE_TSC" is also heavy-weight (visible
in your patch's context), but that one is nevertheless handled in
kvm_get_msrs().

My only reason for (b) is simply symmetry. For example, commit 48a5f3bc
added HV_X64_MSR_REFERENCE_TSC at once to both put() and get().

According to "target-i386/machine.c", mtrr_deftype and co. are even
migrated (part of vmstate), so this asymmetry could become a problem in
migration. Eg. source host doesn't fetch MTRR state from KVM, hence wire
format carries garbage, but on the target you put (part of) that garbage
(right now, just the mask) back into KVM:

do_savevm()
  qemu_savevm_state()
    qemu_savevm_state_complete()
      cpu_synchronize_all_states()
        cpu_synchronize_state()
          kvm_cpu_synchronize_state()
            do_kvm_cpu_synchronize_state()
              kvm_arch_get_registers()
                kvm_get_msrs()

do_loadvm()
  load_vmstate()
    qemu_loadvm_state()
      cpu_synchronize_all_post_init()
        cpu_synchronize_post_init()
          kvm_cpu_synchronize_post_init()
            kvm_arch_put_registers(..., KVM_PUT_FULL_STATE)
              kvm_put_msrs(..., KVM_PUT_FULL_STATE)

/* state subset modified during VCPU reset */
#define KVM_PUT_RESET_STATE     2

/* full state set, modified during initialization or on vmload */
#define KVM_PUT_FULL_STATE      3

Hence I suspect (a) and (b) should be handled.

... And then we arrive at cross-version migration, where both source and
target hosts support MTRR, but the source qemu sends unsynchronized MTRR
data (ie. garbage) in the migration stream, but the target passes it to
KVM. I don't know if this is possible, and if so, what to do about it. :(

(BTW,

        VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, 8, 8),

should be rebased to MSR_MTRRcap_VCNT too, probably.)

Apologies about the verbiage, I just wrote down whatever crossed my
mind. I don't think I said anything overly important, but I feel unsafe
about giving my R-b until someone disproves my migration worries.
(Basically, before the patch, whatever MTRR data was in the migration
stream never reached KVM. This changes now.)

... Is the following argument valid in your opinion?

  KVM cares about guest-specified MTRR values *only* when
  kvm_arch_has_noncoherent_dma() returns true to vmx_get_mt_mask().
  Since "kvm_arch_has_noncoherent_dma() returning true" (ie. device
  assignment) exludes migration anyway, we don't have to care about
  migration of MTRRs.

I'm confused, but that shouldn't block this patch!

Thanks,
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html