Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path

Gleb Natapov <gleb@xxxxxxxxxxxxx> · Fri, 20 Jun 2014 13:46:10 +0300

On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosatti@xxxxxxxxxx wrote:
> > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > deleting a pinned spte.
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |    3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===================================================================
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c	2014-06-13 16:50:50.040140594 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c	2014-06-13 16:51:05.620104451 -0300
> > > @@ -1247,6 +1247,9 @@
> > >  		spte &= ~SPTE_MMU_WRITEABLE;
> > >  	spte = spte & ~PT_WRITABLE_MASK;
> > >  
> > > +	if (is_pinned_spte(spte))
> > > +		mmu_reload_pinned_vcpus(kvm);
> > > +
> > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it anyway
> > on the next vmentry. Isn't it better to just report all pinned pages as dirty alway.
> 
> That was the initial plan, however its awkward to stop vcpus, execute
> get_dirty_log twice, and have pages marked as dirty on the second
> execution.
Indeed, but I think it may happen today with vhost (or even other devices
that emulate dma), so userspace should be able to deal with it already.

> 
> That is, it is in "incorrect" to report pages as dirty when they are
> clean.
In case of PEBS area we cannot know. CPU writes there directly without
KVM even knowing about it so the only sane thing is to assume that after
a vmentry PEBS area is dirty. This is what happens with this patch BTW,
mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
mark all pinned pages as dirty even if pages are never written to. You
can achieve the same by having vcpu->pinned_page_dirty which is set to
true on each vmentry and to false on each GET_DIRTY_LOG.

> 
> Moreover, if the number of pinned pages is larger than the dirty
> threshold to stop VM and migrate, you'll never migrate. If vcpus are
> in HLT and don't VM-enter immediately, the pages should not be refaulted
> right away.
We should not allow that many pinned pages for security reasons. And
having a lot of page to fault in on a next vmentry after each
GET_DIRTY_LOG will slow down a guest during migration considerably.

> 
> Do you think the optimization is worthwhile ?
> 
I think it's simpler, but even if we will go with your current approach
it should be improved: there is no point sending IPI to all vcpus in
spte_write_protect() like you do here since the IPI will be send anyway at
the end of write protect because of ptes writable->nonwritable transition,
so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.

In fact this makes me thinking that write protecting pinned page here is
incorrect because old translation may not be in TLB and if CPU will try to
write PEBS entry after pte is write protected but before MMU is reloaded
it will encounter non writable pte and god knows what will happen, SDM
does not tell. So spte_write_protect() should be something like that IMO:

static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
    u64 spte = *sptep;

    if (!is_writable_pte(spte) &&
          !(pt_protect && spte_is_locklessly_modifiable(spte)))
        return false;

    if (!pt_protect && !is_pinned_spte(spte)) {
        for_each_vcpu()
             kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
        return true;
    }

    rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

    if (pt_protect)
        spte &= ~SPTE_MMU_WRITEABLE;

    spte = spte & ~PT_WRITABLE_MASK;

    return mmu_spte_update(sptep, spte);
}

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html