On Fri, Apr 21, 2023 at 07:43:52PM +0800, Huang, Kai wrote: >On Fri, 2023-04-21 at 14:35 +0800, Binbin Wu wrote: >> On 4/13/2023 5:13 PM, Huang, Kai wrote: >> > > On 4/13/2023 10:27 AM, Huang, Kai wrote: >> > > > On Thu, 2023-04-13 at 09:36 +0800, Binbin Wu wrote: >> > > > > On 4/12/2023 7:58 PM, Huang, Kai wrote: >> > > > > >> > > ... >> > > > > > > + root_gfn = (root_pgd & __PT_BASE_ADDR_MASK) >> PAGE_SHIFT; >> > > > > > Or, should we explicitly mask vcpu->arch.cr3_ctrl_bits? In this >> > > > > > way, below >> > > > > > mmu_check_root() may potentially catch other invalid bits, but in >> > > > > > practice there should be no difference I guess. >> > > > > In previous version, vcpu->arch.cr3_ctrl_bits was used as the mask. >> > > > > >> > > > > However, Sean pointed out that the return value of >> > > > > mmu->get_guest_pgd(vcpu) could be >> > > > > EPTP for nested case, so it is not rational to mask to CR3 bit(s) from EPTP. >> > > > Yes, although EPTP's high bits don't contain any control bits. >> > > > >> > > > But perhaps we want to make it future-proof in case some more control >> > > > bits are added to EPTP too. >> > > > >> > > > > Since the guest pgd has been check for valadity, for both CR3 and >> > > > > EPTP, it is safe to mask off non-address bits to get GFN. >> > > > > >> > > > > Maybe I should add this CR3 VS. EPTP part to the changelog to make it >> > > > > more undertandable. >> > > > This isn't necessary, and can/should be done in comments if needed. >> > > > >> > > > But IMHO you may want to add more material to explain how nested cases >> > > > are handled. >> > > Do you mean about CR3 or others? >> > > >> > This patch is about CR3, so CR3. >> >> For nested case, I plan to add the following in the changelog: >> >> For nested guest: >> - If CR3 is intercepted, after CR3 handled in L1, CR3 will be >> checked in >> nested_vmx_load_cr3() before returning back to L2. >> - For the shadow paging case (SPT02), LAM bits are also be handled >> to form >> a new shadow CR3 in vmx_load_mmu_pgd(). >> >> > >I don't know a lot of code detail of KVM nested code, but in general, since your >code only changes nested_vmx_load_cr3() and nested_vmx_check_host_state(), the >changelog should focus on explaining why modifying these two functions are good >enough. the patch relaxes all existing checks on CR3, allowing previously reserved bits to be set. It should be enough otherwise some checks on CR3 are missing in the first place. > >And to explain this, I think we should explain from hardware's perspective >rather than from shadow paging's perspective. > >From L0's perspective, we care about: > > 1) L1's CR3 register (VMCS01's GUEST_CR3) > 2) L1's VMCS to run L2 guest > 2.1) VMCS12's HOST_CR3 > 2.2) VMCS12's GUEST_CR3 > >For 1) the current changelog has explained (that we need to catch _active_ >control bits in guest's CR3 etc). For nested (case 2)), L1 can choose to >intercept CR3 or not. But this isn't the point because from hardware's point of >view we actually care about below two cases instead: > > 1) L0 to emulate a VMExit from L2 to L1 by VMENTER using VMCS01 > to reflect VMCS12 > 2) L0 to VMENTER to L2 using VMCS02 directly > >The case 2) can fail due to fail to VMENTER to L2 of course but this should >result in L0 to VMENTER to L1 with a emulated VMEXIT from L2 to L1 which is the >case 1). > >For case 1) we need new code to check VMCS12's HOST_CR3 against guest's _active_ >CR3 feature control bits. The key code path is: > > vmx_handle_exit() > -> prepare_vmcs12() > -> load_vmcs12_host_state(). > >For case 2) I _think_ we need new code to check both VMCS12's HOST_CR3 and >GUEST_CR3 against active control bits. The key code path is ... > > nested_vmx_run() -> > -> nested_vmx_check_host_state() > -> nested_vmx_enter_non_root_mode() > -> prepare_vmcs02_early() > -> prepare_vmcs02() > >Since nested_vmx_load_cr3() is used in both VMENTER using VMCS12's HOST_CR3 >(VMEXIT to L1) and GUEST_CR3 (VMENTER to L2), and it currently already checks >kvm_vcpu_is_illegal_gpa(vcpu, cr3), changing it to additionally check guest CR3 >active control bits seems just enough. IMO, current KVM relies on hardware to do consistency check on the GUEST_CR3 during VM entry. > >Also, nested_vmx_check_host_state() (i.e. it is called in nested_vmx_run() to >return early if any HOST state is wrong) currently checks >kvm_vcpu_is_illegal_gpa(vcpu, cr3) too so we should also change it. > >That being said, I do find it's not easy to come up with a "concise" changelog >to cover both non-nested and (especially) nested cases, but it seems we can >abstract some from my above typing? > >My suggestion is to focus on the hardware behaviour's perspective as typed >above. But I believe Sean can easily make a "concise" changelog if he wants to >comment here :)