On Sat, Apr 22, 2023 at 11:32:26AM +0800, Binbin Wu wrote: >Kai, > >Thanks for your inputs. > >I rephrased the changelog as following: > > >LAM uses CR3.LAM_U48 (bit 62) and CR3.LAM_U57 (bit 61) to configure LAM >masking for user mode pointers. > >To support LAM in KVM, CR3 validity checks and shadow paging handling need to >be >modified accordingly. > >== CR3 validity Check == >When LAM is supported, CR3 LAM bits are allowed to be set and the check of >CR3 >needs to be modified. it is better to describe the hardware change here: On processors that enumerate support for LAM, CR3 register allows LAM_U48/U57 to be set and VM entry allows LAM_U48/U57 to be set in both GUEST_CR3 and HOST_CR3 fields. To emulate LAM hardware behavior, KVM needs to 1. allow LAM_U48/U57 to be set to the CR3 register by guest or userspace 2. allow LAM_U48/U57 to be set to the GUES_CR3/HOST_CR3 fields in vmcs12 >Add a helper kvm_vcpu_is_legal_cr3() and use it instead of >kvm_vcpu_is_legal_gpa() >to do the new CR3 checks in all existing CR3 checks as following: >When userspace sets sregs, CR3 is checked in kvm_is_valid_sregs(). >Non-nested case >- When EPT on, CR3 is fully under control of guest. >- When EPT off, CR3 is intercepted and CR3 is checked in kvm_set_cr3() during > CR3 VMExit handling. >Nested case, from L0's perspective, we care about: >- L1's CR3 register (VMCS01's GUEST_CR3), it's the same as non-nested case. >- L1's VMCS to run L2 guest (i.e. VMCS12's HOST_CR3 and VMCS12's GUEST_CR3) > Two paths related: > 1. L0 emulates a VMExit from L2 to L1 using VMCS01 to reflect VMCS12 > nested_vm_exit() > -> load_vmcs12_host_state() > -> nested_vmx_load_cr3() //check VMCS12's HOST_CR3 This is just a byproduct of using a unified function, i.e., nested_vmx_load_cr3() to load CR3 for both nested VM entry and VM exit. LAM spec says: VM entry checks the values of the CR3 and CR4 fields in the guest-area and host-state area of the VMCS. In particular, the bits in these fields that correspond to bits reserved in the corresponding register are checked and must be 0. It doesn't mention any check on VM exit. So, it looks to me that VM exit doesn't do consistency checks. Then, I think there is no need to call out this path. > 2. L0 to VMENTER to L2 using VMCS02 > nested_vmx_run() > -> nested_vmx_check_host_state() //check VMCS12's HOST_CR3 > -> nested_vmx_enter_non_root_mode() > -> prepare_vmcs02() > -> nested_vmx_load_cr3() //check VMCS12's GUEST_CR3 > Path 2 can fail to VMENTER to L2 of course, but later this should result in > path 1. > >== Shadow paging handling == >When EPT is off, the following changes needed to handle shadow paging: >- LAM bits should be stripped to perform GFN calculation from guest PGD when >it > is CR3 (for nested EPT case, guest PGD is nested EPTP). > To be generic, extract the maximal base address mask from guest PGD since >the > validity of guest PGD has been checked already. >- Leave LAM bits in root.pgd to force a new root for a CR3+LAM combination. > It could potentially increase root cache misses and MMU reload, however, > it's very rare to turn off EPT when performance matters. >- Active CR3 LAM bits should be kept to form a shadow CR3. > >To be generic, introduce a field 'cr3_ctrl_bits' in kvm_vcpu_arch to record >the bits used to control supported features related to CR3 (e.g. LAM). >- Supported control bits are set to the field after set cpuid. >- the field is used in > kvm_vcpu_is_legal_cr3() to check CR3. > kvm_get_active_cr3_ctrl_bits() to extract active control bits of CR3. > Also as a quick check for LAM feature support. > >On 4/21/2023 7:43 PM, Huang, Kai wrote: >> On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote: >> > On 4/13/2023 5:13 PM, Huang, Kai wrote: >> > > > On 4/13/2023 10:27 AM, Huang, Kai wrote: >> > > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote: >> > > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote: >> > > > > > >> > > > ... >> > > > > > > > + root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT; >> > > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits? In this >> > > > > > > way, below >> > > > > > > mmu_check_root() may potentially catch other invalid bits, but in >> > > > > > > practice there should be no difference I guess. >> > > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask. >> > > > > > >> > > > > > However, Sean pointed out that the return value of >> > > > > > mmu->get_guest_pgd(vcpu) could be >> > > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP. >> > > > > Yes, although EPTP's high bits don't contain any control bits. >> > > > > >> > > > > But perhaps we want to make it future-proof in case some more control >> > > > > bits are added to EPTP too. >> > > > > >> > > > > > Since the guest pgd has been check for valadity, for both CR3 and >> > > > > > EPTP, it is safe to mask off non-address bits to get GFN. >> > > > > > >> > > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it >> > > > > > more undertandable. >> > > > > This isn't necessary, and can/should be done in comments if needed. >> > > > > >> > > > > But IMHO you may want to add more material to explain how nested cases >> > > > > are handled. >> > > > Do you mean about CR3 or others? >> > > > >> > > This patch is about CR3, so CR3. >> > For nested case, I plan to add the following in the changelog: >> > >> > For nested guest: >> > - If CR3 is intercepted, after CR3 handled in L1, CR3 will be >> > checked in >> > nested_vmx_load_cr3() before returning back to L2. >> > - For the shadow paging case (SPT02), LAM bits are also be handled >> > to form >> > a new shadow CR3 in vmx_load_mmu_pgd(). >> > >> > >> I don't know a lot of code detail of KVM nested code, but in general, since your >> code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the >> changelog should focus on explaining why modifying these two functions are good >> enough. >> >> And to explain this, I think we should explain from hardware's perspective >> rather than from shadow paging's perspective. >> >> From L0's perspective, we care about: >> >> 1) L1's CR3 register (VMCS01's GUEST_CR3) >> 2) L1's VMCS to run L2 guest >> 2.1) VMCS12's HOST_CR3 >> 2.2) VMCS12's GUEST_CR3 >> >> For 1) the current changelog has explained (that we need to catch _active_ >> control bits in guest's CR3 etc). For nested (case 2)), L1 can choose to >> intercept CR3 or not. But this isn't the point because from hardware's point of >> view we actually care about below two cases instead: >> >> 1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01 >> to reflect VMCS12 >> 2) L0 to VMENTER to L2 using VMCS02 directly >> >> The case 2) can fail due to fail to VMENTER to L2 of course but this should >> result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the >> case 1). >> >> For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_ >> CR3 feature control bits. The key code path is: >> >> vmx_handle_exit() >> -> prepare_vmcs12() >> -> load_vmcs12_host_state(). >> >> For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and >> GUEST_CR3 against active control bits. The key code path is >> >> nested_vmx_run() -> >> -> nested_vmx_check_host_state() >> -> nested_vmx_enter_non_root_mode() >> -> prepare_vmcs02_early() >> -> prepare_vmcs02() >> >> Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3 >> (VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks >> kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3 >> active control bits seems just enough. >> >> Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to >> return early if any HOST state is wrong) currently checks >> kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it. >> >> That being said, I do find it's not easy to come up with a "concise" changelog >> to cover both non-nested and (especially) nested cases, but it seems we can >> abstract some from my above typing? >> >> My suggestion is to focus on the hardware behaviour's perspective as typed >> above. But I believe Sean can easily make a "concise" changelog if he wants to >> comment here :)