On Fri, Aug 10, 2012 at 10:34:39AM +1000, Paul Mackerras wrote: > On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote: > > > The !memslot->npages case is handled in __kvm_set_memory_region > > (please read that part, before kvm_arch_prepare_memory_region() call). > > > > kvm_arch_flush_shadow should be implemented. > > Book3S HV doesn't have shadow page tables per se, rather the hardware > page table is under the control of the hypervisor (i.e. KVM), and > entries are added and removed by the guest using hypercalls. On > recent machines (POWER7) the hypervisor can choose whether or not to > have the hardware PTE point to a real page of memory; if it doesn't, > access by the guest will trap to the hypervisor. On older machines > (PPC970) we don't have that flexibility, and we have to provide a real > page of memory (i.e. RAM or I/O) behind every hardware PTE. (This is > because PPC970 provides no way for page faults in the guest to go to > the hypervisor.) > > I could implement kvm_arch_flush_shadow to remove the backing pages > behind every hardware PTE, but that would be very slow and inefficient > on POWER7, and would break the guest on PPC970, particularly in the > case where userspace is removing a small memory slot containing some > I/O device and leaving the memory slot for system RAM untouched. > > So the reason for unmapping the hardware PTEs in > kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is > that that way we know which memslot is going away. > > What exactly are the semantics of kvm_arch_flush_shadow? It removes all translations mapped via memslots. Its used in cases where the translations become stale, or during shutdown. > I presume that on x86 with NPT/EPT it basically does nothing - is that right? It does, it removes all NPT/EPT ptes (named "sptes" in arch/x86/kvm/). The translations are rebuilt on demand (when accesses by the guest fault into the HV). > > > + if (old->npages) { > > > + /* modifying guest_phys or flags */ > > > + if (old->base_gfn != memslot->base_gfn) > > > + kvmppc_unmap_memslot(kvm, old); > > > > This case is also handled generically by the last kvm_arch_flush_shadow > > call in __kvm_set_memory_region. > > Again, to use this we would need to know which memslot we're > flushing. If we could change __kvm_set_memory_region to pass the > memslot for these kvm_arch_flush_shadow calls, then I could do as you > suggest. (Though I would need to think carefully about what would > happen with guest invalidations of hardware PTEs in the interval > between the rcu_assign_pointer(kvm->memslots, slots) and the > kvm_arch_flush_shadow, and whether the invalidation would find the > correct location in the rmap array, given that we have updated the > base_gfn in the memslot without first getting rid of any references to > those pages in the hardware page table.) That can be done. I'll send a patch to flush per memslot in the next days, you can work out the PPC details in the meantime. To be clear: this is necessary to have consistent behaviour across arches in the kvm_set_memory codepath which is tricky (not nitpicking). Alternatively, kvm_arch_flush_shadow can be split into two methods (but thats not necessary if memslot information is sufficient for PPC). > > > + if (memslot->dirty_bitmap && > > > + old->dirty_bitmap != memslot->dirty_bitmap) > > > + kvmppc_hv_get_dirty_log(kvm, old); > > > + return 0; > > > + } > > > > Better clear dirty log unconditionally on kvm_arch_commit_memory_region, > > similarly to x86 (just so its consistent). > > OK. > > > > --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c > > > +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c > > > @@ -81,7 +81,7 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index, > > > ptel = rev->guest_rpte |= rcbits; > > > gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel)); > > > memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn); > > > - if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID)) > > > + if (!memslot) > > > return; > > > > Why remove this check? (i don't know why it was there in the first > > place, just checking). > > This is where we are removing the page backing a hardware PTE and thus > removing the hardware PTE from the reverse-mapping list for the page. > We want to be able to do that properly even if the memslot is in the > process of going away. I had the flags check in there originally > because other places that used a memslot had that check, but when I > read __kvm_set_memory_region more carefully I realized that the > KVM_MEMSLOT_INVALID flag indicates that we should not create any more > references to pages in the memslot, but we do still need to be able to > handle references going away, i.e. pages in the memslot getting > unmapped. > > Paul. Yes, thats it. kvm_arch_flush_shadow requires functional memslot lookup, for example. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html