Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

Avi Kivity <avi.kivity@xxxxxxxxx> · Sun, 22 Jun 2014 16:35:24 +0300





On 06/19/2014 09:26 PM, Marcelo Tosatti wrote:
On Thu, Jun 19, 2014 at 11:01:06AM +0300, Avi Kivity wrote:
On 06/19/2014 02:12 AM, mtosatti@xxxxxxxxxx wrote:
Allow vcpus to pin spte translations by:

1) Creating a per-vcpu list of pinned ranges.
2) On mmu reload request:
	- Fault ranges.
	- Mark sptes with a pinned bit.
	- Mark shadow pages as pinned.

3) Then modify the following actions:
	- Page age => skip spte flush.
	- MMU notifiers => force mmu reload request (which kicks cpu out of
				guest mode).
	- GET_DIRTY_LOG => force mmu reload request.
	- SLAB shrinker => skip shadow page deletion.

TDP-only.

+int kvm_mmu_register_pinned_range(struct kvm_vcpu *vcpu,
+				  gfn_t base_gfn, unsigned long npages)
+{
+	struct kvm_pinned_page_range *p;
+
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
+		if (p->base_gfn == base_gfn && p->npages == npages) {
+			mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+			return -EEXIST;
+		}
+	}
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+
+	if (vcpu->arch.nr_pinned_ranges >=
+	    KVM_MAX_PER_VCPU_PINNED_RANGE)
+		return -ENOSPC;
+
+	p = kzalloc(sizeof(struct kvm_pinned_page_range), GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+
+	vcpu->arch.nr_pinned_ranges++;
+
+	trace_kvm_mmu_register_pinned_range(vcpu->vcpu_id, base_gfn, npages);
+
+	INIT_LIST_HEAD(&p->link);
+	p->base_gfn = base_gfn;
+	p->npages = npages;
+	mutex_lock(&vcpu->arch.pinned_mmu_mutex);
+	list_add(&p->link, &vcpu->arch.pinned_mmu_pages);
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+	kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+
+	return 0;
+}
+
What happens if ranges overlap (within a vcpu, cross-vcpu)?
The page(s) are faulted multiple times if ranges overlap within a vcpu.

I see no reason to disallow overlapping ranges. Do you?

Not really.  Just making sure nothing horrible happens.


Or if a range overflows and wraps around 0?
Pagefault fails on vm-entry -> KVM_REQ_TRIPLE_FAULT.

Will double check for overflows to make sure.

Will the loop terminate?

Looks like you're limiting the number of ranges, but not the number
of pages, so a guest can lock all of its memory.
Yes. The page pinning at get_page time can also lock all of
guest memory.

I'm sure that can't be good.  Maybe subject this pinning to the task 
mlock limit.


+
+/*
+ * Pin KVM MMU page translations. This guarantees, for valid
+ * addresses registered by kvm_mmu_register_pinned_range (valid address
+ * meaning address which posses sufficient information for fault to
+ * be resolved), valid translations exist while in guest mode and
+ * therefore no VM-exits due to faults will occur.
+ *
+ * Failure to instantiate pages will abort guest entry.
+ *
+ * Page frames should be pinned with get_page in advance.
+ *
+ * Pinning is not guaranteed while executing as L2 guest.
Does this undermine security?
PEBS writes should not be enabled when L2 guest is executing.

What prevents L1 for setting up PEBS MSRs for L2?

+	list_for_each_entry(p, &vcpu->arch.pinned_mmu_pages, link) {
+		gfn_t gfn_offset;
+
+		for (gfn_offset = 0; gfn_offset < p->npages; gfn_offset++) {
+			gfn_t gfn = p->base_gfn + gfn_offset;
+			int r;
+			bool pinned = false;
+
+			r = vcpu->arch.mmu.page_fault(vcpu, gfn << PAGE_SHIFT,
+						     PFERR_WRITE_MASK, false,
+						     true, &pinned);
+			/* MMU notifier sequence window: retry */
+			if (!r && !pinned)
+				kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+			if (r) {
+				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+				break;
+			}
+
+		}
+	}
+	mutex_unlock(&vcpu->arch.pinned_mmu_mutex);
+}
+
  int kvm_mmu_load(struct kvm_vcpu *vcpu)
  {
  	int r;
@@ -3916,6 +4101,7 @@
  		goto out;
  	/* set_cr3() should ensure TLB has been flushed */
  	vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
+	kvm_mmu_pin_pages(vcpu);
  out:
  	return r;
  }

I don't see where  you unpin pages, so even if you limit the number
of pinned pages, a guest can pin all of memory by iterating over all
of memory and pinning it a chunk at a time.
The caller should be responsible for limiting number of pages pinned it
is pinning the struct pages?

The caller would be the debug store data are MSR callbacks. How would 
they know what the limit it?


And in that case, should remove any limiting from this interface, as
that is confusing.

You might try something similar to guest MTRR handling.
Please be more verbose.


mtrr_state already provides physical range attributes that are looked up 
on every fault, so I thought you could get the pinned attribute from 
there. But I guess that's too late, you want to pre-fault eveything.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html