On Tue, May 17, 2022 at 9:02 AM Lai Jiangshan <jiangshanlai@xxxxxxxxx> wrote: > > On Tue, May 17, 2022 at 4:45 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Fri, Apr 15, 2022, Lai Jiangshan wrote: > > > From: Lai Jiangshan <jiangshan.ljs@xxxxxxxxxxxx> > > > > > > When NPT enabled L1 is PAE paging, vcpu->arch.mmu->get_pdptrs() which > > > is nested_svm_get_tdp_pdptr() reads the guest NPT's PDPTE from memroy > > > unconditionally for each call. > > > > > > The guest PAE root page is not write-protected. > > > > > > The mmu->get_pdptrs() in FNAME(walk_addr_generic) might get different > > > values every time or it is different from the return value of > > > mmu->get_pdptrs() in mmu_alloc_shadow_roots(). > > > > > > And it will cause FNAME(fetch) installs the spte in a wrong sp > > > or links a sp to a wrong parent since FNAME(gpte_changed) can't > > > check these kind of changes. > > > > > > Cache the PDPTEs and the problem is resolved. The guest is responsible > > > to info the host if its PAE root page is updated which will cause > > > nested vmexit and the host updates the cache when next nested run. > > > > Hmm, no, the guest is responsible for invalidating translations that can be > > cached in the TLB, but the guest is not responsible for a full reload of PDPTEs. > > Per the APM, the PDPTEs can be cached like regular PTEs: > > > > Under SVM, however, when the processor is in guest mode with PAE enabled, the > > guest PDPT entries are not cached or validated at this point, but instead are > > loaded and checked on demand in the normal course of address translation, just > > like page directory and page table entries. Any reserved bit violations ared > > etected at the point of use, and result in a page-fault (#PF) exception rather > > than a general-protection (#GP) exception. > > > > So if L1 modifies a PDPTE from !PRESENT (or RESERVED) to PRESENT (and valid), then > > any active L2 vCPUs should recognize the new PDPTE without a nested VM-Exit because > > the old entry can't have been cached in the TLB. > > In this case, it is still !PRESENT in the shadow page, and it will cause > a vmexit when L2 tries to use the translation. I can't see anything wrong > in TLB or vTLB(shadow pages). > > But I think some code is needed to reload the cached PDPTEs > when guest_mmu->get_pdptrs() returns !PRESENT and reload mmu if > the cache is changed. (and add code to avoid infinite loop) > > The patch fails to reload mmu if the cache is changed which > leaves the problem described in the changelog partial resolved > only. > > Maybe we need to add mmu parameter back in load_pdptrs() for it. > It is still too complicated, I will try to do a direct check in FNAME(fetch) instead of (re-)using the cache. > > > > In practice, snapshotting at nested VMRUN would likely work, but architecturally > > it's wrong and could cause problems if L1+L2 are engange in paravirt shenanigans, > > e.g. async #PF comes to mind. > > > > I believe the correct way to fix this is to write-protect nNPT PDPTEs like all other > > shadow pages, which shouldn't be too awful to do as part of your series to route > > PDPTEs through kvm_mmu_get_page(). > > In the one-off special shadow page (will be renamed to one-off local > shadow page) > patchsets, PAE PDPTEs is not write-protected. Wirte-protecting it causing nasty > code.