Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 19 Jun 2014 15:26:27 -0300

On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote:
> 
> On 06/19/2014 02:12 AM, mtosatti@xxxxxxxxxx wrote:
> >Allow vcpus to pin spte translations by:
> >
> >1) Creating a per-vcpu list of pinned ranges.
> >2) On mmu reload request:
> >	- Fault ranges.
> >	- Mark sptes with a pinned bit.
> >	- Mark shadow pages as pinned.
> >
> >3) Then modify the following actions:
> >	- Page age => skip spte flush.
> >	- MMU notifiers => force mmu reload request (which kicks cpu out of
> >				guest mode).
> >	- GET_DIRTY_LOG => force mmu reload request.
> >	- SLAB shrinker => skip shadow page deletion.
> >
> >TDP-only.
> >
> >+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
> >+				  gfn_t base_gfn, unsigned long npages)
> >+{
> >+	struct kvm_pinned_page_range *p;
> >+
> >+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> >+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> >+		if (p->base_gfn == base_gfn && p->npages == npages) {
> >+			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+			return -EEXIST;
> >+		}
> >+	}
> >+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+
> >+	if (vcpu->arch.nr_pinned_ranges >=
> >+	    KVM_MAX_PER_VCPU_PINNED_RANGE)
> >+		return -ENOSPC;
> >+
> >+	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
> >+	if (!p)
> >+		return -ENOMEM;
> >+
> >+	vcpu->arch.nr_pinned_ranges++;
> >+
> >+	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
> >+
> >+	INIT_LIST_HEAD(&p->link);
> >+	p->base_gfn = base_gfn;
> >+	p->npages = npages;
> >+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> >+	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
> >+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> >+
> >+	return 0;
> >+}
> >+
> 
> What happens if ranges overlap (within a vcpu, cross-vcpu)?

The page(s) are faulted multiple times if ranges overlap within a vcpu.

I see no reason to disallow overlapping ranges. Do you?

> Or if a range overflows and wraps around 0? 

Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT.

Will double check for overflows to make sure.

> Or if it does not refer to RAM?

User should have pinned the page(s) before with gfn_to_page / get_page,
which ensures it is guest RAM ? (hum, although it might be good to 
double check here as well).

> Looks like you're limiting the number of ranges, but not the number
> of pages, so a guest can lock all of its memory.

Yes. The page pinning at get_page time can also lock all of
guest memory.

> >+
> >+/*
> >+ * Pin KVM MMU page translations. This guarantees, for valid
> >+ * addresses registered by kvm_mmu_register_pinned_range (valid address
> >+ * meaning address which posses sufficient information for fault to
> >+ * be resolved), valid translations exist while in guest mode and
> >+ * therefore no VM-exits due to faults will occur.
> >+ *
> >+ * Failure to instantiate pages will abort guest entry.
> >+ *
> >+ * Page frames should be pinned with get_page in advance.
> >+ *
> >+ * Pinning is not guaranteed while executing as L2 guest.
> 
> Does this undermine security?

PEBS writes should not be enabled when L2 guest is executing.

> >+static void kvm_mmu_pin_pages(struct kvm_vcpu *vcpu)
> >+{
> >+	struct kvm_pinned_page_range *p;
> >+
> >+	if (is_guest_mode(vcpu))
> >+		return;
> >+
> >+	if (!vcpu->arch.mmu.direct_map)
> >+		return;
> >+
> >+	ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
> >+
> >+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
> 
> Is the mutex actually needed? It seems it's only taken in vcpu
> context, so the vcpu mutex should be sufficient.

Right. Actually the list_empty() access from kicker function might be unsafe.
Will double check.

> >+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
> >+		gfn_t gfn_offset;
> >+
> >+		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
> >+			gfn_t gfn = p->base_gfn + gfn_offset;
> >+			int r;
> >+			bool pinned = false;
> >+
> >+			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
> >+						     PFERR_WRITE_MASK, false,
> >+						     true, &pinned);
> >+			/* MMU notifier sequence window: retry */
> >+			if (!r && !pinned)
> >+				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> >+			if (r) {
> >+				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> >+				break;
> >+			}
> >+
> >+		}
> >+	}
> >+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
> >+}
> >+
> >  int kvm_mmu_load(struct kvm_vcpu *vcpu)
> >  {
> >  	int r;
> >@@ -3916,6 +4101,7 @@
> >  		goto out;
> >  	/* set_cr3() should ensure TLB has been flushed */
> >  	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
> >+	kvm_mmu_pin_pages(vcpu);
> >  out:
> >  	return r;
> >  }
> >
> 
> I don't see where  you unpin pages, so even if you limit the number
> of pinned pages, a guest can pin all of memory by iterating over all
> of memory and pinning it a chunk at a time.

The caller should be responsible for limiting number of pages pinned it
is pinning the struct pages?

And in that case, should remove any limiting from this interface, as
that is confusing.

> You might try something similar to guest MTRR handling.

Please be more verbose.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html