Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 9 Aug 2012 22:25:32 -0300

On Fri, Aug 10, 2012 at 10:34:39AM +1000, Paul Mackerras wrote:
> On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote:
> 
> > The !memslot->npages case is handled in __kvm_set_memory_region
> > (please read that part, before kvm_arch_prepare_memory_region() call).
> > 
> > kvm_arch_flush_shadow should be implemented.
> 
> Book3S HV doesn't have shadow page tables per se, rather the hardware
> page table is under the control of the hypervisor (i.e. KVM), and
> entries are added and removed by the guest using hypercalls.  On
> recent machines (POWER7) the hypervisor can choose whether or not to
> have the hardware PTE point to a real page of memory; if it doesn't,
> access by the guest will trap to the hypervisor.  On older machines
> (PPC970) we don't have that flexibility, and we have to provide a real
> page of memory (i.e. RAM or I/O) behind every hardware PTE.  (This is
> because PPC970 provides no way for page faults in the guest to go to
> the hypervisor.)
> 
> I could implement kvm_arch_flush_shadow to remove the backing pages
> behind every hardware PTE, but that would be very slow and inefficient
> on POWER7, and would break the guest on PPC970, particularly in the
> case where userspace is removing a small memory slot containing some
> I/O device and leaving the memory slot for system RAM untouched.
> 
> So the reason for unmapping the hardware PTEs in
> kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is
> that that way we know which memslot is going away.
> 
> What exactly are the semantics of kvm_arch_flush_shadow?  

It removes all translations mapped via memslots. Its used in cases where
the translations become stale, or during shutdown.

> I presume that on x86 with NPT/EPT it basically does nothing - is that right?

It does, it removes all NPT/EPT ptes (named "sptes" in arch/x86/kvm/).
The translations are rebuilt on demand (when accesses by the guest fault
into the HV).

> > > +	if (old->npages) {
> > > +		/* modifying guest_phys or flags */
> > > +		if (old->base_gfn != memslot->base_gfn)
> > > +			kvmppc_unmap_memslot(kvm, old);
> > 
> > This case is also handled generically by the last kvm_arch_flush_shadow
> > call in __kvm_set_memory_region.
> 
> Again, to use this we would need to know which memslot we're
> flushing.  If we could change __kvm_set_memory_region to pass the
> memslot for these kvm_arch_flush_shadow calls, then I could do as you
> suggest.  (Though I would need to think carefully about what would
> happen with guest invalidations of hardware PTEs in the interval
> between the rcu_assign_pointer(kvm->memslots, slots) and the
> kvm_arch_flush_shadow, and whether the invalidation would find the
> correct location in the rmap array, given that we have updated the
> base_gfn in the memslot without first getting rid of any references to
> those pages in the hardware page table.)

That can be done.

I'll send a patch to flush per memslot in the next days, you can work
out the PPC details in the meantime.

To be clear: this is necessary to have consistent behaviour across
arches in the kvm_set_memory codepath which is tricky (not nitpicking).

Alternatively, kvm_arch_flush_shadow can be split into two methods (but
thats not necessary if memslot information is sufficient for PPC).

> > > +		if (memslot->dirty_bitmap &&
> > > +		    old->dirty_bitmap != memslot->dirty_bitmap)
> > > +			kvmppc_hv_get_dirty_log(kvm, old);
> > > +		return 0;
> > > +	}
> > 
> > Better clear dirty log unconditionally on kvm_arch_commit_memory_region,
> > similarly to x86 (just so its consistent).
> 
> OK.
> 
> > > --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > > +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > > @@ -81,7 +81,7 @@ static void remove_revmap_chain(struct kvm *kvm, long pte_index,
> > >  	ptel = rev->guest_rpte |= rcbits;
> > >  	gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel));
> > >  	memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > > -	if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
> > > +	if (!memslot)
> > >  		return;
> > 
> > Why remove this check? (i don't know why it was there in the first
> > place, just checking).
> 
> This is where we are removing the page backing a hardware PTE and thus
> removing the hardware PTE from the reverse-mapping list for the page.
> We want to be able to do that properly even if the memslot is in the
> process of going away.  I had the flags check in there originally
> because other places that used a memslot had that check, but when I
> read __kvm_set_memory_region more carefully I realized that the
> KVM_MEMSLOT_INVALID flag indicates that we should not create any more
> references to pages in the memslot, but we do still need to be able to
> handle references going away, i.e. pages in the memslot getting
> unmapped.
> 
> Paul.

Yes, thats it. kvm_arch_flush_shadow requires functional memslot lookup,
for example.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html