Kai,
Thanks for your inputs.
I rephrased the changelog as following:
LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM
masking for user mode pointers.
To support LAM in KVM, CR3 validity checks and shadow paging handling
need to be
modified accordingly.
== CR3 validity Check ==
When LAM is supported, CR3 LAM bits are allowed to be set and the check
of CR3
needs to be modified.
Add a helper kvm_vcpu_is_legal_cr3() and use it instead of
kvm_vcpu_is_legal_gpa()
to do the new CR3 checks in all existing CR3 checks as following:
When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs().
Non-nested case
- When EPT on, CR3 is fully under control of guest.
- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3()
during
CR3 VMExit handling.
Nested case, from L0's perspective, we care about:
- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case.
- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3)
Two paths related:
1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12
nested_vm_exit()
-> load_vmcs12_host_state()
-> nested_vmx_load_cr3() //check VMCS12's HOST_CR3
2. L0 to VMENTER to L2 using VMCS02
nested_vmx_run()
-> nested_vmx_check_host_state() //check VMCS12's HOST_CR3
-> nested_vmx_enter_non_root_mode()
-> prepare_vmcs02()
-> nested_vmx_load_cr3() //check VMCS12's GUEST_CR3
Path 2 can fail to VMENTER to L2 of course, but later this should
result in
path 1.
== Shadow paging handling ==
When EPT is off, the following changes needed to handle shadow paging:
- LAM bits should be stripped to perform GFN calculation from guest PGD
when it
is CR3 (for nested EPT case, guest PGD is nested EPTP).
To be generic, extract the maximal base address mask from guest PGD
since the
validity of guest PGD has been checked already.
- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination.
It could potentially increase root cache misses and MMU reload, however,
it's very rare to turn off EPT when performance matters.
- Active CR3 LAM bits should be kept to form a shadow CR3.
To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record
the bits used to control supported features related to CR3 (e.g. LAM).
- Supported control bits are set to the field after set cpuid.
- the field is used in
kvm_vcpu_is_legal_cr3() to check CR3.
kvm_get_active_cr3_ctrl_bits() to extract active control bits of CR3.
Also as a quick check for LAM feature support.
On 4/21/2023 7:43 PM, Huang, Kai wrote:
On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote:
On 4/13/2023 5:13 PM, Huang, Kai wrote:
On 4/13/2023 10:27 AM, Huang, Kai wrote:
On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote:
On 4/12/2023 7:58 PM, Huang, Kai wrote:
...
+ root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT;
Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits? In this
way, below
mmu_check_root() may potentially catch other invalid bits, but in
practice there should be no difference I guess.
In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask.
However, Sean pointed out that the return value of
mmu->get_guest_pgd(vcpu) could be
EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP.
Yes, although EPTP's high bits don't contain any control bits.
But perhaps we want to make it future-proof in case some more control
bits are added to EPTP too.
Since the guest pgd has been check for valadity, for both CR3 and
EPTP, it is safe to mask off non-address bits to get GFN.
Maybe I should add this CR3 VS. EPTP part to the changelog to make it
more undertandable.
This isn't necessary, and can/should be done in comments if needed.
But IMHO you may want to add more material to explain how nested cases
are handled.
Do you mean about CR3 or others?
This patch is about CR3, so CR3.
For nested case, I plan to add the following in the changelog:
For nested guest:
- If CR3 is intercepted, after CR3 handled in L1, CR3 will be
checked in
nested_vmx_load_cr3() before returning back to L2.
- For the shadow paging case (SPT02), LAM bits are also be handled
to form
a new shadow CR3 in vmx_load_mmu_pgd().
I don't know a lot of code detail of KVM nested code, but in general, since your
code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the
changelog should focus on explaining why modifying these two functions are good
enough.
And to explain this, I think we should explain from hardware's perspective
rather than from shadow paging's perspective.
From L0's perspective, we care about:
1) L1's CR3 register (VMCS01's GUEST_CR3)
2) L1's VMCS to run L2 guest
2.1) VMCS12's HOST_CR3
2.2) VMCS12's GUEST_CR3
For 1) the current changelog has explained (that we need to catch _active_
control bits in guest's CR3 etc). For nested (case 2)), L1 can choose to
intercept CR3 or not. But this isn't the point because from hardware's point of
view we actually care about below two cases instead:
1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01
to reflect VMCS12
2) L0 to VMENTER to L2 using VMCS02 directly
The case 2) can fail due to fail to VMENTER to L2 of course but this should
result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the
case 1).
For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_
CR3 feature control bits. The key code path is:
vmx_handle_exit()
-> prepare_vmcs12()
-> load_vmcs12_host_state().
For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and
GUEST_CR3 against active control bits. The key code path is
nested_vmx_run() ->
-> nested_vmx_check_host_state()
-> nested_vmx_enter_non_root_mode()
-> prepare_vmcs02_early()
-> prepare_vmcs02()
Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3
(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks
kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3
active control bits seems just enough.
Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to
return early if any HOST state is wrong) currently checks
kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it.
That being said, I do find it's not easy to come up with a "concise" changelog
to cover both non-nested and (especially) nested cases, but it seems we can
abstract some from my above typing?
My suggestion is to focus on the hardware behaviour's perspective as typed
above. But I believe Sean can easily make a "concise" changelog if he wants to
comment here :)