Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jun 17, 2024 at 11:37 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Jun 17, 2024, James Houghton wrote:
> > On Fri, Jun 14, 2024 at 4:17 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > Ooh!  Actually, after fiddling a bit to see how feasible fast-aging in the shadow
> > > MMU would be, I'm pretty sure we can do straight there for nested TDP.  Or rather,
> > > I suspect/hope we can get close enough for an initial merge, which would allow
> > > aging_is_fast to be a property of the mmu_notifier, i.e. would simplify things
> > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for each notification.
> > >
> > > Walking KVM's rmaps requires mmu_lock because adding/removing rmap entries is done
> > > in such a way that a lockless walk would be painfully complex.  But if there is
> > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] points directly at
> > > that one SPTE.  And with nested TDP, unless L1 is doing something uncommon, e.g.
> > > mapping the same page into multiple L2s, that overwhelming vast majority of rmaps
> > > have only one entry.  That's not the case for legacy shadow paging because kernels
> > > almost always map a pfn using multiple virtual addresses, e.g. Linux's direct map
> > > along with any userspace mappings.

Hi Sean, sorry for taking so long to get back to you.

So just to make sure I have this right: if L1 is using TDP, the gfns
in L0 will usually only be mapped by a single spte. If L1 is not using
TDP, then all bets are off. Is that true?

If that is true, given that we don't really have control over whether
or not L1 decides to use TDP, the lockless shadow MMU walk will work,
but, if L1 is not using TDP, it will often return false negatives
(says "old" for an actually-young gfn). So then I don't really
understand conditioning the lockless shadow MMU walk on us (L0) using
the TDP MMU[1]. We care about L1, right?

(Maybe you're saying that, when the TDP MMU is enabled, the only cases
where the shadow MMU is used are cases where gfns are practically
always mapped by a single shadow PTE. This isn't how I understood your
mail, but this is what your hack-a-patch[1] makes me think.)

[1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@xxxxxxxxxx/

>
> ...
>
> > Hmm, interesting. I need to spend a little bit more time digesting this.
> >
> > Would you like to see this included in v6? (It'd be nice to avoid the
> > WAS_FAST stuff....) Should we leave it for a later series? I haven't
> > formed my own opinion yet.
>
> I would say it depends on the viability and complexity of my idea.  E.g. if it
> pans out more or less like my rough sketch, then it's probably worth taking on
> the extra code+complexity in KVM to avoid the whole WAS_FAST goo.
>
> Note, if we do go this route, the implementation would need to be tweaked to
> handle the difference in behavior between aging and last-minute checks for eviction,
> which I obviously didn't understand when I threw together that hack-a-patch.
>
> I need to think more about how best to handle that though, e.g. skipping GFNs with
> multiple mappings is probably the worst possible behavior, as we'd risk evicting
> hot pages.  But falling back to taking mmu_lock for write isn't all that desirable
> either.

I think falling back to the write lock is more desirable than evicting
a young page.

I've attached what I think could work, a diff on top of this series.
It builds at least. It uses rcu_read_lock/unlock() for
walk_shadow_page_lockless_begin/end(NULL), and it puts a
synchronize_rcu() in kvm_mmu_commit_zap_page().

It doesn't get rid of the WAS_FAST things because it doesn't do
exactly what [1] does. It basically makes three calls now: lockless
TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the
locked shadow MMU bits if the lockless bits say !young (instead of
being conditioned on tdp_mmu_enabled). My choice is definitely
questionable for the clear path.

Thanks!
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 841cee3f346d..548d1e480de9 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -640,6 +640,37 @@ static bool mmu_spte_age(u64 *sptep)
 	return true;
 }
 
+/*
+ * Similar to mmu_spte_age(), but this one should be used for lockless shadow
+ * page table walks.
+ */
+static bool mmu_spte_age_lockless(u64 *sptep)
+{
+	u64 old_spte = mmu_spte_get_lockless(sptep);
+	u64 new_spte;
+
+	if (!is_accessed_spte(old_spte))
+		return false;
+
+	if (spte_ad_enabled(old_spte))
+		clear_bit((ffs(shadow_accessed_mask) - 1),
+			  (unsigned long *)sptep);
+	else {
+		new_spte = mark_spte_for_access_track(old_spte);
+		if (!try_cmpxchg64(sptep, &old_spte, new_spte))
+			/*
+			 * If the spte changed, it's likely that the gfn
+			 * is young.
+			 */
+			return true;
+
+		if (is_writable_pte(old_spte))
+			kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+	}
+
+	return true;
+}
+
 static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)
 {
 	return tdp_mmu_enabled && vcpu->arch.mmu->root_role.direct;
@@ -647,6 +678,11 @@ static inline bool is_tdp_mmu_active(struct kvm_vcpu *vcpu)
 
 static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 {
+	if (!vcpu) {
+		rcu_read_lock();
+		return;
+	}
+
 	if (is_tdp_mmu_active(vcpu)) {
 		kvm_tdp_mmu_walk_lockless_begin();
 	} else {
@@ -666,6 +702,11 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
 
 static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 {
+	if (!vcpu) {
+		rcu_read_unlock();
+		return;
+	}
+
 	if (is_tdp_mmu_active(vcpu)) {
 		kvm_tdp_mmu_walk_lockless_end();
 	} else {
@@ -949,14 +990,14 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 	int count = 0;
 
 	if (!rmap_head->val) {
-		rmap_head->val = (unsigned long)spte;
+		WRITE_ONCE(&rmap_head->val, (unsigned long)spte);
 	} else if (!(rmap_head->val & 1)) {
 		desc = kvm_mmu_memory_cache_alloc(cache);
 		desc->sptes[0] = (u64 *)rmap_head->val;
 		desc->sptes[1] = spte;
 		desc->spte_count = 2;
 		desc->tail_count = 0;
-		rmap_head->val = (unsigned long)desc | 1;
+		WRITE_ONCE(&rmap_head->val, (unsigned long)desc | 1);
 		++count;
 	} else {
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
@@ -971,7 +1012,7 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			desc->more = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 			desc->spte_count = 0;
 			desc->tail_count = count;
-			rmap_head->val = (unsigned long)desc | 1;
+			WRITE_ONCE(&rmap_head->val, (unsigned long)desc | 1);
 		}
 		desc->sptes[desc->spte_count++] = spte;
 	}
@@ -1009,9 +1050,10 @@ static void pte_list_desc_remove_entry(struct kvm *kvm,
 	 * head at the next descriptor, i.e. the new head.
 	 */
 	if (!head_desc->more)
-		rmap_head->val = 0;
+		WRITE_ONCE(&rmap_head->val, 0);
 	else
-		rmap_head->val = (unsigned long)head_desc->more | 1;
+		WRITE_ONCE(&rmap_head->val,
+				(unsigned long)head_desc->more | 1);
 	mmu_free_pte_list_desc(head_desc);
 }
 
@@ -1028,7 +1070,7 @@ static void pte_list_remove(struct kvm *kvm, u64 *spte,
 		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val != spte, kvm))
 			return;
 
-		rmap_head->val = 0;
+		WRITE_ONCE(&rmap_head->val, 0);
 	} else {
 		desc = (struct pte_list_desc *)(rmap_head->val & ~1ul);
 		while (desc) {
@@ -1078,7 +1120,7 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val = 0;
+	WRITE_ONCE(&rmap_head->val, 0);
 	return true;
 }
 
@@ -1634,17 +1676,64 @@ static bool kvm_has_shadow_mmu_sptes(struct kvm *kvm)
 	return !tdp_mmu_enabled || READ_ONCE(kvm->arch.indirect_shadow_pages);
 }
 
+static bool kvm_age_rmap_fast(u64 *sptep)
+{
+	return mmu_spte_age_lockless(sptep);
+}
+
+static bool kvm_test_age_rmap_fast(u64 *sptep)
+{
+	return is_accessed_spte(READ_ONCE(*sptep));
+}
+
+typedef bool (*rmap_lockless_handler_t)(u64 *sptep);
+
+static __always_inline bool kvm_handle_gfn_range_lockless(
+		struct kvm *kvm, struct kvm_gfn_range *range,
+		rmap_lockless_handler_t handler)
+{
+	struct kvm_rmap_head *rmap;
+	u64 *sptep;
+	gfn_t gfn;
+	int level;
+	bool ret = false;
+
+	walk_shadow_page_lockless_begin(NULL);
+
+	for (gfn = range->start; gfn < range->end; gfn++) {
+		for (level = PG_LEVEL_4K; level <= KVM_MAX_HUGEPAGE_LEVEL;
+		     level++) {
+			rmap = gfn_to_rmap(gfn, level, range->slot);
+			sptep = (void *)READ_ONCE(rmap->val);
+
+			/* Skip this gfn if multiple SPTEs mapping it */
+			if ((unsigned long)sptep & 1)
+				continue;
+
+			ret |= handler(sptep);
+		}
+	}
+
+	walk_shadow_page_lockless_end(NULL);
+
+	return ret;
+}
+
 bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	bool young = false;
+	bool young = false, shadow_young = false;
 
-	if (tdp_mmu_enabled) {
+	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
-		if (young)
-			range->arg.report_fast = true;
-	}
 
-	if (!range->arg.fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
+	shadow_young = kvm_handle_gfn_range_lockless(kvm, range,
+					       kvm_age_rmap_fast);
+	young |= shadow_young;
+	if (young)
+		range->arg.report_fast = true;
+
+	else if (!shadow_young && !range->arg.fast_only &&
+		 kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
 		young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
 		write_unlock(&kvm->mmu_lock);
@@ -1657,11 +1746,15 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool young = false;
 
-	if (tdp_mmu_enabled) {
+	if (tdp_mmu_enabled)
 		young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
-		if (young)
-			range->arg.report_fast = true;
-	}
+
+	if (!young)
+		young |= kvm_handle_gfn_range_lockless(kvm, range,
+						       kvm_test_age_rmap_fast);
+
+	if (young)
+		range->arg.report_fast = true;
 
 	if (!young && !range->arg.fast_only && kvm_has_shadow_mmu_sptes(kvm)) {
 		write_lock(&kvm->mmu_lock);
@@ -2636,6 +2729,12 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 	 */
 	kvm_flush_remote_tlbs(kvm);
 
+	/*
+	 * Wait for any non-vCPU lockless shadow page table walkers to stop
+	 * using the shadow pages we're about to free.
+	 */
+	synchronize_rcu();
+
 	list_for_each_entry_safe(sp, nsp, invalid_list, link) {
 		WARN_ON_ONCE(!sp->role.invalid || sp->root_count);
 		kvm_mmu_free_shadow_page(sp);

[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux