On 4/27/2023 9:19 PM, Huang, Kai wrote:
On Sat, 2023-04-22 at 12:43 +0800, Gao, Chao wrote:
On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote:
Kai,
Thanks for your inputs.
I rephrased the changelog as following:
LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
masking for user mode pointers.
To support LAM in KVM, CR3 validity checks and shadow paging handling need to
be
modified accordingly.
== CR3 validity Check ==
When LAM is supported, CR3 LAM bits are allowed to be set and the check of
CR3
needs to be modified.
it is better to describe the hardware change here:
On processors that enumerate support for LAM, CR3 register allows
LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both
GUEST_CR3 and HOST_CR3 fields.
To emulate LAM hardware behavior, KVM needs to
1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace
2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12
Agreed. This is more clearer.
Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
kvm_vcpu_is_legal_gpa()
to do the new CR3 checks in all existing CR3 checks as following:
When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
Non-nested case
- When EPT on, CR3 is fully under control of guest.
- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during
CR3 VMExit handling.
Nested case, from L0's perspective, we care about:
- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
Two paths related:
1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
nested_vm_exit()
-> load_vmcs12_host_state()
-> nested_vmx_load_cr3() //check VMCS12's HOST_CR3
This is just a byproduct of using a unified function, i.e.,
nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit.
LAM spec says:
VM entry checks the values of the CR3 and CR4 fields in the guest-area
and host-state area of the VMCS. In particular, the bits in these fields
that correspond to bits reserved in the corresponding register are
checked and must be 0.
It doesn't mention any check on VM exit. So, it looks to me that VM exit
doesn't do consistency checks. Then, I think there is no need to call
out this path.
But this isn't a true VMEXIT -- it is indeed a VMENTER from L0 to L1 using
VMCS01 but with an environment that allows L1 to run its VMEXIT handler just
like it received a VMEXIT from L2.
However I fully agree those code paths are details and shouldn't be changelog
material.
How about below changelog?
Add support to allow guest to set two new CR3 non-address control bits to allow
guest to enable the new Intel CPU feature Linear Address Masking (LAM).
LAM modifies the checking that is applied to 64-bit linear addresses, allowing
software to use of the untranslated address bits for metadata. For user mode
linear address, LAM uses two new CR3 non-address bits LAM_U48 (bit 62) and
LAM_U57 (bit 61) to configure the metadata bits for 4-level paging and 5-level
paging respectively. LAM also changes VMENTER to allow both bits to be set in
VMCS's HOST_CR3 and GUEST_CR3 to support virtualization.
When EPT is on, CR3 is not trapped by KVM and it's up to the guest to set any of
the two LAM control bits. However when EPT is off, the actual CR3 used by the
guest is generated from the shadow MMU root which is different from the CR3 that
is *set* by the guest, and KVM needs to manually apply any active control bits
to VMCS's GUEST_CR3 based on the cached CR3 *seen* by the guest.
KVM manually checks guest's CR3 to make sure it points to a valid guest physical
address (i.e. to support smaller MAXPHYSADDR in the guest). Extend this check
to allow the two LAM control bits to be set. And to make such check generic,
introduce a new field 'cr3_ctrl_bits' to vcpu to record all feature control bits
that are allowed to be set by the guest.
In case of nested, for a guest which supports LAM, both VMCS12's HOST_CR3 and
GUEST_CR3 are allowed to have the new LAM control bits set, i.e. when L0 enters
L1 to emulate a VMEXIT from L2 to L1 or when L0 enters L2 directly. KVM also
manually checks VMCS12's HOST_CR3 and GUEST_CR3 being valid physical address.
Extend such check to allow the new LAM control bits too.
Note, LAM doesn't have a global enable bit in any control register to turn
on/off LAM completely, but purely depends on hardware's CPUID to determine
whether to perform LAM check or not. That means, when EPT is on, even when KVM
doesn't expose LAM to guest, the guest can still set LAM control bits in CR3 w/o
causing problem. This is an unfortunate virtualization hole. KVM could choose
to intercept CR3 in this case and inject fault but this would hurt performance
when running a normal VM w/o LAM support. This is undesirable. Just choose to
let the guest do such illegal thing as the worst case is guest being killed when
KVM eventually find out such illegal behaviour and that is the guest to blame.
Thanks for the advice.