On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote: > The !memslot->npages case is handled in __kvm_set_memory_region > (please read that part, before kvm_arch_prepare_memory_region() call). > > kvm_arch_flush_shadow should be implemented. Book3S HV doesn't have shadow page tables per se, rather the hardware page table is under the control of the hypervisor (i.e. KVM), and entries are added and removed by the guest using hypercalls. On recent machines (POWER7) the hypervisor can choose whether or not to have the hardware PTE point to a real page of memory; if it doesn't, access by the guest will trap to the hypervisor. On older machines (PPC970) we don't have that flexibility, and we have to provide a real page of memory (i.e. RAM or I/O) behind every hardware PTE. (This is because PPC970 provides no way for page faults in the guest to go to the hypervisor.) I could implement kvm_arch_flush_shadow to remove the backing pages behind every hardware PTE, but that would be very slow and inefficient on POWER7, and would break the guest on PPC970, particularly in the case where userspace is removing a small memory slot containing some I/O device and leaving the memory slot for system RAM untouched. So the reason for unmapping the hardware PTEs in kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is that that way we know which memslot is going away. What exactly are the semantics of kvm_arch_flush_shadow? I presume that on x86 with NPT/EPT it basically does nothing - is that right? > > + if (old->npages) { > > + /* modifying guest_phys or flags */ > > + if (old->base_gfn != memslot->base_gfn) > > + kvmppc_unmap_memslot(kvm, old); > > This case is also handled generically by the last kvm_arch_flush_shadow > call in __kvm_set_memory_region. Again, to use this we would need to know which memslot we're flushing. If we could change __kvm_set_memory_region to pass the memslot for these kvm_arch_flush_shadow calls, then I could do as you suggest. (Though I would need to think carefully about what would happen with guest invalidations of hardware PTEs in the interval between the rcu_assign_pointer(kvm->memslots, slots) and the kvm_arch_flush_shadow, and whether the invalidation would find the correct location in the rmap array, given that we have updated the base_gfn in the memslot without first getting rid of any references to those pages in the hardware page table.) > > + if (memslot->dirty_bitmap && > > + old->dirty_bitmap != memslot->dirty_bitmap) > > + kvmppc_hv_get_dirty_log(kvm, old); > > + return 0; > > + } > > Better clear dirty log unconditionally on kvm_arch_commit_memory_region, > similarly to x86 (just so its consistent). OK. > > --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c > > +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c > > @@ -81,7 +81,7 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, > > ptel = rev->guest_rpte |= rcbits; > > gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel)); > > memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn); > > - if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID)) > > + if (!memslot) > > return; > > Why remove this check? (i don't know why it was there in the first > place, just checking). This is where we are removing the page backing a hardware PTE and thus removing the hardware PTE from the reverse-mapping list for the page. We want to be able to do that properly even if the memslot is in the process of going away. I had the flags check in there originally because other places that used a memslot had that check, but when I read __kvm_set_memory_region more carefully I realized that the KVM_MEMSLOT_INVALID flag indicates that we should not create any more references to pages in the memslot, but we do still need to be able to handle references going away, i.e. pages in the memslot getting unmapped. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html