On Thu, Jun 23, 2011 at 11:19:26AM +0800, Xiao Guangrong wrote: > Marcelo, > > Thanks for your review! > > On 06/23/2011 05:59 AM, Marcelo Tosatti wrote: > > >> static int is_large_pte(u64 pte) > >> @@ -2123,6 +2158,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, > >> u64 spte, entry = *sptep; > >> int ret = 0; > >> > >> + if (set_mmio_spte(sptep, gfn, pfn, pte_access)) > >> + return 0; > >> + > > > > > > Should zap all mmio sptes when creating new memory slots (current qemu > > never exhibits that pattern, but its not an unsupported scenario). > > Easier to zap all mmu in that case. > > > > OK, will do it in the next version. > > > For shadow you need to remove mmio sptes on invlpg and all mmio sptes > > under CR3 (mmu_sync_roots), other than clearing the gva/gpa variables. > > > > Oh, kvm_mmu_pte_write and invlpg do not zap mmio spte, i will > fix these, thanks for your point it out. > > For mmu_sync_roots, we properly do it in FNAME(sync_page) in this patch: > > static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access, > int *nr_present) > { > if (unlikely(is_mmio_spte(*sptep))) { > if (gfn != get_mmio_spte_gfn(*sptep)) { > mmu_spte_clear_no_track(sptep); > return true; > } > > (*nr_present)++; > mark_mmio_spte(sptep, gfn, access); > return true; > } > > return false; > } > > > Otherwise you can move the mmio info from an mmio spte back to > > mmio_gva/mmio_gfn after a TLB flush, without rereading the guest > > pagetable. > > > > We do not read the guest page table when mmio page fault occurred, > we just do it as you say: > - Firstly, walk the shadow page table to get the mmio spte > - Then, cache the mmio spte info to mmio_gva/mmio_gfn > > I missed your meaning? I meant that guest pagetables must be read again on CR3 switch or invlpg (GVA->GPA can change in that case), which only happens if mmio sptes are zapped. > >> + struct kvm_shadow_walk_iterator iterator; > >> + u64 spte, ret = 0ull; > >> + > >> + walk_shadow_page_lockless_begin(vcpu); > >> + for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) { > >> + if (iterator.level == 1) { > >> + ret = spte; > >> + break; > >> + } > > > > Have you verified it is safe to walk sptes with parallel modifications > > to them? Other than shadow page deletion which is in theory covered by > > RCU. It would be good to have all cases written down. > > > > Um, i think it is safe, when spte is being write, we can get the old spte > or the new spte, both cases are ok for us: it is just like the page structure > cache on the real machine, the cpu can fetch the old spte even if the page > structure is changed by other cpus. > >> + > >> + if (!is_shadow_present_pte(spte)) > >> + break; > >> + } > >> + walk_shadow_page_lockless_end(vcpu); > >> + > >> + return ret; > >> +} > >> + > >> +/* > >> + * If it is a real mmio page fault, return 1 and emulat the instruction > >> + * directly, return 0 if it needs page fault path to fix it, -1 is > >> + * returned if bug is detected. > >> + */ > >> +int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct) > >> +{ > >> + u64 spte; > >> + > >> + if (quickly_check_mmio_pf(vcpu, addr, direct)) > >> + return 1; > >> + > >> + spte = walk_shadow_page_get_mmio_spte(vcpu, addr); > >> + > >> + if (is_mmio_spte(spte)) { > >> + gfn_t gfn = get_mmio_spte_gfn(spte); > >> + unsigned access = get_mmio_spte_access(spte); > >> + > >> + if (direct) > >> + addr = 0; > >> + vcpu_cache_mmio_info(vcpu, addr, gfn, access); > >> + return 1; > >> + } > >> + > >> + /* > >> + * It's ok if the gva is remapped by other cpus on shadow guest, > >> + * it's a BUG if the gfn is not a mmio page. > >> + */ > > > > If the gva is remapped there should be no mmio page fault in the first > > place. > > > > For example, as follow case: > > VCPU 0 VCPU 1 > gva is mapped to a mmio region > access gva > remap gva to other page > emulate mmio access by kvm > (*the point we are discussing*) > flush all tlb > > In theory, it can happen. > > >> + if (direct && is_shadow_present_pte(spte)) > >> + return -1; > > > > An spte does not have to contain the present bit to generate a valid EPT > > misconfiguration (and an spte dump is still required in that case). > > Use !is_mmio_spte() instead. > > > > We can not use !is_mmio_spte() here, since the shadow page can be zapped anytime, > for example: > > sp.spt[i] = mmio-spte > > VCPU 0 VCPU 1 > Access sp.spte[i], ept misconfig is occurred > delete sp > (if the number of shadow page is out of the limit > or page shrink is required, and other events...) > > Walk shadow page out of the lock and get the > non-present spte > (*the point we are discussing*) Then is_mmio_spte(non-present spte) == false, right? Point is that it only sptes with precise mmio spte pattern should be considered mmio sptes, otherwise consider a genuine EPT misconfiguration error (which must be reported). What about using fault code instead of spte as Avi suggested instead? > So, the bug we can detect is: it is the mmio access but the spte is point to the normal > page. > > > > >> + > >> + /* > >> + * If the page table is zapped by other cpus, let the page > >> + * fault path to fix it. > >> + */ > >> + return 0; > >> +} > > > > I don't understand when would this happen, can you please explain? > > > > The case is above :-) No need to jump to page fault handler, can let CPU fault again on non present spte. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html