Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 9 Jul 2024 10:49:34 -0700

On Mon, Jul 08, 2024, James Houghton wrote:
> On Fri, Jun 28, 2024 at 7:38 PM James Houghton <jthoughton@xxxxxxxxxx> wrote:
> >
> > On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 17, 2024, James Houghton wrote:
> > > > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > > > Ooh!  Actually, after fiddling a bit to see how feasible fast-aging in the shadow
> > > > > MMU would be, I'm pretty sure we can do straight there for nested TDP.  Or rather,
> > > > > I suspect/hope we can get close enough for an initial merge, which would allow
> > > > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
> > > > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.
> > > > >
> > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
> > > > > in such a way that a lockless walk would be painfully complex.  But if there is
> > > > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
> > > > > that one SPTE.  And with nested TDP, unless L1 is doing something uncommon, e.g.
> > > > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
> > > > > have only one entry.  That's not the case for legacy shadow paging because kernels
> > > > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
> > > > > along with any userspace mappings.
> >
> > Hi Sean, sorry for taking so long to get back to you.
> >
> > So just to make sure I have this right: if L1 is using TDP, the gfns
> > in L0 will usually only be mapped by a single spte. If L1 is not using
> > TDP, then all bets are off. Is that true?
> >
> > If that is true, given that we don't really have control over whether
> > or not L1 decides to use TDP, the lockless shadow MMU walk will work,
> > but, if L1 is not using TDP, it will often return false negatives
> > (says "old" for an actually-young gfn). So then I don't really
> > understand conditioning the lockless shadow MMU walk on us (L0) using
> > the TDP MMU[1]. We care about L1, right?
> 
> Ok I think I understand now. If L1 is using shadow paging, L2 is
> accessing memory the same way L1 would, so we use the TDP MMU at L0
> for this case (if tdp_mmu_enabled). If L1 is using TDP, then we must
> use the shadow MMU, so that's the interesting case.

Yep.
 
> > (Maybe you're saying that, when the TDP MMU is enabled, the only cases
> > where the shadow MMU is used are cases where gfns are practically
> > always mapped by a single shadow PTE. This isn't how I understood your
> > mail, but this is what your hack-a-patch[1] makes me think.)
> 
> So it appears that this interpretation is actually what you meant.

Yep.

> > [1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@xxxxxxxxxx/
> >
> > >
> > > ...
> > >
> > > > Hmm, interesting. I need to spend a little bit more time digesting this.
> > > >
> > > > Would you like to see this included in v6? (It'd be nice to avoid the
> > > > WAS_FAST stuff....) Should we leave it for a later series? I haven't
> > > > formed my own opinion yet.
> > >
> > > I would say it depends on the viability and complexity of my idea.  E.g. if it
> > > pans out more or less like my rough sketch, then it's probably worth taking on
> > > the extra code+complexity in KVM to avoid the whole WAS_FAST goo.
> > >
> > > Note, if we do go this route, the implementation would need to be tweaked to
> > > handle the difference in behavior between aging and last-minute checks for eviction,
> > > which I obviously didn't understand when I threw together that hack-a-patch.
> > >
> > > I need to think more about how best to handle that though, e.g. skipping GFNs with
> > > multiple mappings is probably the worst possible behavior, as we'd risk evicting
> > > hot pages.  But falling back to taking mmu_lock for write isn't all that desirable
> > > either.
> >
> > I think falling back to the write lock is more desirable than evicting
> > a young page.
> >
> > I've attached what I think could work, a diff on top of this series.
> > It builds at least. It uses rcu_read_lock/unlock() for
> > walk_shadow_page_lockless_begin/end(NULL), and it puts a
> > synchronize_rcu() in kvm_mmu_commit_zap_page().
> >
> > It doesn't get rid of the WAS_FAST things because it doesn't do
> > exactly what [1] does. It basically makes three calls now: lockless
> > TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the
> > locked shadow MMU bits if the lockless bits say !young (instead of
> > being conditioned on tdp_mmu_enabled). My choice is definitely
> > questionable for the clear path.
> 
> I still don't think we should get rid of the WAS_FAST stuff.

I do :-)

> The assumption that the L1 VM will almost never share pages between L2
> VMs is questionable. The real question becomes: do we care to have
> accurate age information for this case? I think so.

I think you're conflating two different things.  WAS_FAST isn't about accuracy,
it's about supporting lookaround in conditionally fast secondary MMUs.

Accuracy only comes into play when we're talking about the last-minute check,
which, IIUC, has nothing to do with WAS_FAST because any potential lookaround has
already been performed.

> It's not completely trivial to get the lockless walking of the shadow
> MMU rmaps correct either (please see the patch I attached here[1]).

Heh, it's not correct.  Invoking synchronize_rcu() in kvm_mmu_commit_zap_page()
is illegal, as mmu_lock (rwlock) is held and synchronize_rcu() might_sleep().

For kvm_test_age_rmap_fast(), KVM can blindly read READ_ONCE(*sptep).  KVM might
read garbage, but that would be an _extremely_ rare scenario, and reporting a
zapped page as being young is acceptable in that 1 in a billion situation.

For kvm_age_rmap_fast(), i.e. where KVM needs to write, I'm pretty sure KVM can
handle that by rechecking the rmap and using CMPXCHG to write the SPTE.  If the
rmap is unchanged, then the old SPTE value is guaranteed to be valid, in the sense
that its value most definitely came from a KVM shadow page table.  Ah, drat, that
won't work, because very theoretically, the page table could be freed, reallocated,
and rewritten with the exact same value by something other than KVM.  Hrm.

Looking more closely, I think we can go straight to supporting rmap walks outside
of mmu_lock.  There will still be a "lock", but it will be a *very* rudimentary
lock, akin to the TDP MMU's REMOVED_SPTE approach.  Bit 0 of rmap_head->val is
used to indicate "many", while bits 63:3/31:2 on 64-bit/32-bit KVM hold the
pointer (to a SPTE or a list).  That means bit 1 is available for shenanigans.

If we use bit 1 to lock the rmap, then the fast mmu_notifier can safely walk the
entire rmap chain.  And with a reader/write scheme, the rmap walks that are
performed under mmu_lock don't need to lock the rmap, which means flows like
kvm_mmu_zap_collapsible_spte() don't need to be modified to avoid recursive
self-deadlock.  Lastly, the locking can be conditioned on the rmap being valid,
i.e. having at least one SPTE.  That way the common case of a gfn not having any
rmaps is a glorified nop.

Adding the locking isn't actually all that difficult, with the *huge* caveat that
the below patch is compile-tested only.  The vast majority of the churn is to make
it so existing code ignores the new KVM_RMAP_LOCKED bit.

I don't know that we should pursue such an approach in this series unless we have
to.  E.g. if we can avoid WAS_FAST or don't have to carry too much intermediate
complexity, then it'd probably be better to land the TDP MMU support first and
then add nested TDP support later.

At the very least, it does make me more confident that a fast walk of the rmaps
is very doable (at least for nested TDP), i.e. makes me even more steadfast
against adding WAS_FAST.

> And the WAS_FAST functionality isn't even that complex to begin with.

I agree the raw code isn't terribly complex, but it's not trivial either.  And the
concept and *behavior* is complex, which is just as much of a maintenance burden
as the code itself.  E.g. it requires knowing that KVM has multiple MMUs buried
behind a single mmu_notifier, and that a "hit" on the fast MMU will trigger
lookaround on the fast MMU, but not the slow MMU.  Understanding and describing
the implications of that behavior isn't easy.  E.g. if GFN=X is young in the TDP
MMU, but X+1..X+N are young only in the shadow MMU, is doing lookaround and making
decisions based purely on the TDP MMU state the "right" behavior?

I also really don't like bleeding KVM details into the mmu_nofitier APIs.  The
need for WAS_FAST is 100% a KVM limitation.  AFAIK, no other secondary MMU has
multiple MMU implementations active behind a single notifier, and other than lack
of support, nothing fundamentally prevents a fast query in the shadow MMU.

---
 arch/x86/kvm/mmu/mmu.c | 163 ++++++++++++++++++++++++++++++++---------
 1 file changed, 128 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 842a3a4cdfe9..bfcfdc0a8600 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -935,9 +935,59 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
  * About rmap_head encoding:
  *
  * If the bit zero of rmap_head->val is clear, then it points to the only spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
  * pte_list_desc containing more mappings.
  */
+#define KVM_RMAP_MANY	BIT(0)
+#define KVM_RMAP_LOCKED	BIT(1)
+
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long old_val, new_val;
+
+	old_val = READ_ONCE(rmap_head->val);
+	if (!old_val)
+		return 0;
+
+	do {
+		while (old_val & KVM_RMAP_LOCKED) {
+			old_val = READ_ONCE(rmap_head->val);
+			cpu_relax();
+		}
+		if (!old_val)
+			return 0;
+
+		new_val = old_val | KVM_RMAP_LOCKED;
+	} while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
+
+	return old_val;
+}
+
+static unsigned long kvm_rmap_write_lock(struct kvm_rmap_head *rmap_head)
+{
+	return kvm_rmap_lock(rmap_head);
+}
+
+static void kvm_rmap_write_ulock(struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	WRITE_ONCE(rmap_head->val, new_val);
+}
+
+static unsigned long kvm_rmap_read_lock(struct kvm_rmap_head *rmap_head)
+{
+	return kvm_rmap_lock(rmap_head);
+}
+
+static void kvm_rmap_read_unlock(struct kvm_rmap_head *rmap_head,
+				 unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	WRITE_ONCE(rmap_head->val, old_val & ~KVM_RMAP_LOCKED);
+}
 
 /*
  * Returns the number of pointers in the rmap chain, not counting the new one.
@@ -945,21 +995,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu
 static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
+	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count = 0;
 
-	if (!rmap_head->val) {
-		rmap_head->val = (unsigned long)spte;
-	} else if (!(rmap_head->val & 1)) {
+	old_val = kvm_rmap_write_lock(rmap_head);
+
+	if (!old_val) {
+		new_val = (unsigned long)spte;
+	} else if (!(old_val & KVM_RMAP_MANY)) {
 		desc = kvm_mmu_memory_cache_alloc(cache);
-		desc->sptes[0] = (u64 *)rmap_head->val;
+		desc->sptes[0] = (u64 *)old_val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
 		desc->tail_count = 0;
-		rmap_head->val = (unsigned long)desc | 1;
+		new_val = (unsigned long)desc | KVM_RMAP_MANY;
 		++count;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+		desc = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 		count = desc->tail_count + desc->spte_count;
 
 		/*
@@ -968,21 +1021,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 		 */
 		if (desc->spte_count == PTE_LIST_EXT) {
 			desc = kvm_mmu_memory_cache_alloc(cache);
-			desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+			desc->more = (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 			desc->spte_count = 0;
 			desc->tail_count = count;
-			rmap_head->val = (unsigned long)desc | 1;
+			new_val = (unsigned long)desc | KVM_RMAP_MANY;
+		} else {
+			new_val = old_val;
 		}
 		desc->sptes[desc->spte_count++] = spte;
 	}
+
+	kvm_rmap_write_ulock(rmap_head, new_val);
+
 	return count;
 }
 
-static void pte_list_desc_remove_entry(struct kvm *kvm,
-				       struct kvm_rmap_head *rmap_head,
+static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rmap_val,
 				       struct pte_list_desc *desc, int i)
 {
-	struct pte_list_desc *head_desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	struct pte_list_desc *head_desc = (struct pte_list_desc *)(*rmap_val & ~KVM_RMAP_MANY);
 	int j = head_desc->spte_count - 1;
 
 	/*
@@ -1009,9 +1066,9 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 	 * head at the next descriptor, i.e. the new head.
 	 */
 	if (!head_desc->more)
-		rmap_head->val = 0;
+		*rmap_val = 0;
 	else
-		rmap_head->val = (unsigned long)head_desc->more | 1;
+		*rmap_val = (unsigned long)head_desc->more | KVM_RMAP_MANY;
 	mmu_free_pte_list_desc(head_desc);
 }
 
@@ -1019,24 +1076,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 			    struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
+	unsigned long rmap_val;
 	int i;
 
-	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm))
-		return;
+	rmap_val = kvm_rmap_write_lock(rmap_head);
+	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
+		goto out;
 
-	if (!(rmap_head->val & 1)) {
-		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm))
-			return;
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val != spte, kvm))
+			goto out;
 
-		rmap_head->val = 0;
+		rmap_val = 0;
 	} else {
-		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+		desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 		while (desc) {
 			for (i = 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] == spte) {
-					pte_list_desc_remove_entry(kvm, rmap_head,
+					pte_list_desc_remove_entry(kvm, &rmap_val,
 								   desc, i);
-					return;
+					goto out;
 				}
 			}
 			desc = desc->more;
@@ -1044,6 +1103,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 
 		KVM_BUG_ON_DATA_CORRUPTION(true, kvm);
 	}
+
+out:
+	kvm_rmap_write_ulock(rmap_head, rmap_val);
 }
 
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -1058,17 +1120,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 				   struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc, *next;
+	unsigned long rmap_val;
 	int i;
 
-	if (!rmap_head->val)
+	rmap_val = kvm_rmap_write_lock(rmap_head);
+	if (!rmap_val)
 		return false;
 
-	if (!(rmap_head->val & 1)) {
-		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val);
 		goto out;
 	}
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 
 	for (; desc; desc = next) {
 		for (i = 0; i < desc->spte_count; i++)
@@ -1078,20 +1142,21 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
+	kvm_rmap_write_ulock(rmap_head, 0);
 	return true;
 }
 
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
+	unsigned long rmap_val = READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
 	struct pte_list_desc *desc;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return 0;
-	else if (!(rmap_head->val & 1))
+	else if (!(rmap_val & KVM_RMAP_MANY))
 		return 1;
 
-	desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	return desc->tail_count + desc->spte_count;
 }
 
@@ -1134,6 +1199,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
  */
 struct rmap_iterator {
 	/* private fields */
+	struct rmap_head *head;
 	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
 	int pos;			/* index of the sptep */
 };
@@ -1148,18 +1214,19 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
+	unsigned long rmap_val = READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
 	u64 *sptep;
 
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return NULL;
 
-	if (!(rmap_head->val & 1)) {
+	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc = NULL;
-		sptep = (u64 *)rmap_head->val;
+		sptep = (u64 *)rmap_val;
 		goto out;
 	}
 
-	iter->desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	iter->desc = (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos = 0;
 	sptep = iter->desc->sptes[iter->pos];
 out:
@@ -1553,6 +1620,32 @@ static __always_inline bool kvm_handle_gfn_range(struct kvm *kvm,
 	return ret;
 }
 
+static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm,
+							  struct kvm_gfn_range *range,
+							  rmap_handler_t handler)
+{
+	struct kvm_rmap_head *rmap_head;
+	unsigned long rmap_val;
+	bool ret = false;
+	gfn_t gfn;
+	int level;
+
+	for (gfn = range->start; gfn < range->end; gfn++) {
+		for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL; level++) {
+			rmap_head = gfn_to_rmap(gfn, level, range->slot);
+			rmap_val = kvm_rmap_read_lock(rmap_head);
+
+			if (rmap_val)
+				ret |= handler(kvm, rmap_head, range->slot, gfn, level);
+
+			kvm_rmap_read_unlock(rmap_head, rmap_val);
+		}
+	}
+
+	return ret;
+}
+
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush = false;

base-commit: 771df9ffadb8204e61d3e98f36c5067102aab78f
--