On Thu, Aug 09, 2012 at 10:25:32PM -0300, Marcelo Tosatti wrote: > On Fri, Aug 10, 2012 at 10:34:39AM +1000, Paul Mackerras wrote: > > On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote: > > > > > The !memslot->npages case is handled in __kvm_set_memory_region > > > (please read that part, before kvm_arch_prepare_memory_region() call). > > > > > > kvm_arch_flush_shadow should be implemented. > > > > Book3S HV doesn't have shadow page tables per se, rather the hardware > > page table is under the control of the hypervisor (i.e. KVM), and > > entries are added and removed by the guest using hypercalls. On > > recent machines (POWER7) the hypervisor can choose whether or not to > > have the hardware PTE point to a real page of memory; if it doesn't, > > access by the guest will trap to the hypervisor. On older machines > > (PPC970) we don't have that flexibility, and we have to provide a real > > page of memory (i.e. RAM or I/O) behind every hardware PTE. (This is > > because PPC970 provides no way for page faults in the guest to go to > > the hypervisor.) > > > > I could implement kvm_arch_flush_shadow to remove the backing pages > > behind every hardware PTE, but that would be very slow and inefficient > > on POWER7, and would break the guest on PPC970, particularly in the > > case where userspace is removing a small memory slot containing some > > I/O device and leaving the memory slot for system RAM untouched. > > > > So the reason for unmapping the hardware PTEs in > > kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is > > that that way we know which memslot is going away. > > > > What exactly are the semantics of kvm_arch_flush_shadow? > > It removes all translations mapped via memslots. Its used in cases where > the translations become stale, or during shutdown. > > > I presume that on x86 with NPT/EPT it basically does nothing - is that right? > > It does, it removes all NPT/EPT ptes (named "sptes" in arch/x86/kvm/). > The translations are rebuilt on demand (when accesses by the guest fault > into the HV). > > > > > + if (old->npages) { > > > > + /* modifying guest_phys or flags */ > > > > + if (old->base_gfn != memslot->base_gfn) > > > > + kvmppc_unmap_memslot(kvm, old); > > > > > > This case is also handled generically by the last kvm_arch_flush_shadow > > > call in __kvm_set_memory_region. > > > > Again, to use this we would need to know which memslot we're > > flushing. If we could change __kvm_set_memory_region to pass the > > memslot for these kvm_arch_flush_shadow calls, then I could do as you > > suggest. (Though I would need to think carefully about what would > > happen with guest invalidations of hardware PTEs in the interval > > between the rcu_assign_pointer(kvm->memslots, slots) and the > > kvm_arch_flush_shadow, and whether the invalidation would find the > > correct location in the rmap array, given that we have updated the > > base_gfn in the memslot without first getting rid of any references to > > those pages in the hardware page table.) > > That can be done. > > I'll send a patch to flush per memslot in the next days, you can work > out the PPC details in the meantime. > > To be clear: this is necessary to have consistent behaviour across > arches in the kvm_set_memory codepath which is tricky (not nitpicking). > > Alternatively, kvm_arch_flush_shadow can be split into two methods (but > thats not necessary if memslot information is sufficient for PPC). What kvm_set_memory requires is that there are no stale translations. On x86, it is cheaper to nuke all translation entries and let them be rebuilt on pagefaults. But, if necessary we can introduce a new callback "kvm_arch_sync_shadow", which can be used by PPC to verify translations are valid, if it makes sense. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html