Re: [PATCH] KVM: nVMX: Fix setting of CR0 and CR4 in guest mode

Gleb Natapov <gleb@xxxxxxxxxx> · Mon, 4 Mar 2013 22:24:13 +0200



On Mon, Mar 04, 2013 at 09:12:25PM +0100, Jan Kiszka wrote:
> On 2013-03-04 21:00, Gleb Natapov wrote:
> > On Mon, Mar 04, 2013 at 08:37:38PM +0100, Jan Kiszka wrote:
> >> On 2013-03-04 20:33, Gleb Natapov wrote:
> >>> On Mon, Mar 04, 2013 at 08:23:52PM +0100, Jan Kiszka wrote:
> >>>> On 2013-03-04 19:39, Gleb Natapov wrote:
> >>>>> On Mon, Mar 04, 2013 at 07:08:08PM +0100, Jan Kiszka wrote:
> >>>>>> On 2013-03-04 18:56, Gleb Natapov wrote:
> >>>>>>> On Mon, Mar 04, 2013 at 03:25:47PM +0100, Jan Kiszka wrote:
> >>>>>>>> On 2013-03-04 15:15, Gleb Natapov wrote:
> >>>>>>>>> On Mon, Mar 04, 2013 at 03:09:51PM +0100, Jan Kiszka wrote:
> >>>>>>>>>> On 2013-03-04 14:22, Gleb Natapov wrote:
> >>>>>>>>>>> On Thu, Feb 28, 2013 at 10:44:47AM +0100, Jan Kiszka wrote:
> >>>>>>>>>>>> The logic for calculating the value with which we call kvm_set_cr0/4 was
> >>>>>>>>>>>> broken (will definitely be visible with nested unrestricted guest mode
> >>>>>>>>>>>> support). Also, we performed the check regarding CR0_ALWAYSON too early
> >>>>>>>>>>>> when in guest mode.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What really needs to be done on both CR0 and CR4 is to mask out L1-owned
> >>>>>>>>>>>> bits and merge them in from GUEST_CR0/4. In contrast, arch.cr0/4 and
> >>>>>>>>>>>> arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus,
> >>>>>>>>>>>> are not suited as input.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
> >>>>>>>>>>>> refuse the update if it fails. To be fully consistent, we implement this
> >>>>>>>>>>>> check now also for CR4.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Finally, we have to set the shadow to the value L2 wanted to write
> >>>>>>>>>>>> originally.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Jan Kiszka <jan.kiszka@xxxxxxxxxxx>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>
> >>>>>>>>>>>> Found while making unrestricted guest mode working. Not sure what impact
> >>>>>>>>>>>> the bugs had on current feature level, if any.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For interested folks, I've pushed my nEPT environment here:
> >>>>>>>>>>>>
> >>>>>>>>>>>>     git://git.kiszka.org/linux-kvm.git nept-hacking
> >>>>>>>>>>>>
> >>>>>>>>>>>>  arch/x86/kvm/vmx.c |   49 ++++++++++++++++++++++++++++++-------------------
> >>>>>>>>>>>>  1 files changed, 30 insertions(+), 19 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>>>>>>>>> index 7cc566b..d1dac08 100644
> >>>>>>>>>>>> --- a/arch/x86/kvm/vmx.c
> >>>>>>>>>>>> +++ b/arch/x86/kvm/vmx.c
> >>>>>>>>>>>> @@ -4605,37 +4605,48 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> >>>>>>>>>>>>  /* called to set cr0 as appropriate for a mov-to-cr0 exit. */
> >>>>>>>>>>>>  static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>> -	if (to_vmx(vcpu)->nested.vmxon &&
> >>>>>>>>>>>> -	    ((val & VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
> >>>>>>>>>>>> -		return 1;
> >>>>>>>>>>>> -
> >>>>>>>>>>>>  	if (is_guest_mode(vcpu)) {
> >>>>>>>>>>>> -		/*
> >>>>>>>>>>>> -		 * We get here when L2 changed cr0 in a way that did not change
> >>>>>>>>>>>> -		 * any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
> >>>>>>>>>>>> -		 * but did change L0 shadowed bits. This can currently happen
> >>>>>>>>>>>> -		 * with the TS bit: L0 may want to leave TS on (for lazy fpu
> >>>>>>>>>>>> -		 * loading) while pretending to allow the guest to change it.
> >>>>>>>>>>>> -		 */
> >>>>>>>>>>> Can't say I understand this patch yet, but it looks like the comment is
> >>>>>>>>>>> still valid. Why have you removed it?
> >>>>>>>>>>
> >>>>>>>>>> L0 allows L1 or L2 at most to own TS, the rest is host-owned. I think
> >>>>>>>>>> the comment was always misleading.
> >>>>>>>>>>
> >>>>>>>>> I do not see how it is misleading. For everything but TS we will not get
> >>>>>>>>> here (if L1 is kvm). For TS we will get here if L1 allows L2 to change
> >>>>>>>>> it, but L0 does not.
> >>>>>>>>
> >>>>>>>> For everything *but guest-owned* we will get here, thus for most CR0
> >>>>>>>> accesses (bit-wise, not regarding frequency).
> >>>>>>>>
> >>>>>>> I do not see how. If bit is trapped by L1 we will not get here. We will
> >>>>>>> do vmexit to L1 instead. nested_vmx_exit_handled_cr() check this condition.
> >>>>>>> I am not arguing about you code (didn't grok it yet), but the comment
> >>>>>>> still make sense to me.
> >>>>>>
> >>>>>> "We get here when L2 changed cr0 in a way that did not change any of
> >>>>>> L1's shadowed bits (see nested_vmx_exit_handled_cr), but did change L0
> >>>>>> shadowed bits." That I can sign. But the rest about TS is just
> >>>>>> misleading as we trap _every_ change in L0 - except for TS under certain
> >>>>>> conditions. The old code was tested against TS only, that's what the
> >>>>>> comment witness.
> >>>>>>
> >>>>> TS is just an example of how we can get here with KVM on KVM. Obviously
> >>>>> other hypervisors may have different configuration. L2 may allow full
> >>>>> guest access to CR0 and then each CR0 write by L2 will be handled here.
> >>>>> Under what other condition "we trap _every_ change in L0 - except for
> >>>>> TS" here?
> >>>>
> >>>> On FPU activation:
> >>>>
> >>>>     cr0_guest_owned_bits = X86_CR0_TS;
> >>>>
> >>>> And on FPU deactivation:
> >>>>
> >>>>     cr0_guest_owned_bits = 0;
> >>>>
> >>> That's exactly TS case that comment explains. Note that
> >>> CR0_GUEST_HOST_MASK = ~cr0_guest_owned_bits.
> >>
> >> Again, it's the inverse of what the comment suggest: we enter
> >> handle_set_cr0 for every change on CR0 that doesn't match the shadow -
> >> except TS was given to the guest by both L1 and L0 (or TS isn't changed
> >> as well).
> > That doesn't make sense to me. I do not even sure what you are saying
> > since you do not specify what shadow is matched. From the code I see
> > that on CR0 exit to L0 from L2 we check if L2 tries to change CR0 bits
> > that L1 claims to belong to it and do #vmexit to L1 if it is:
> > 
> >    if (vmcs12->cr0_guest_host_mask & (val ^ vmcs12->cr0_read_shadow))
> >             return 1;
> > 
> > We never reach handle_set_cr0() in that case.
> > 
> > Can you provide an example with actual values for L2/L1/L0 of what you
> > are trying to say?
> 
> I already provided a concrete one: L1 clears PE/PG from its
> guest_host_mask (assuming we support unrestricted guest mode for L1), L2
> switches from real to protected mode, thus sets PE=1 while the shadow
> (set by L0) holds 0 => we end up in handle_set_cr0.
> 
So how is this "inverse of what the comment suggest"? I do not
understand your grudge against the comment. Just clarify that TS is the
one example of how we can get here if you think that it is not clear enough.
The TS part was useful to me.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html