On Mon, 2024-08-05 at 17:45 -0700, Sean Christopherson wrote: > On Mon, Aug 05, 2024, David Woodhouse wrote: > > From: David Woodhouse <dwmw@xxxxxxxxxxxx> > > > > The existing retry loop in hva_to_pfn_retry() is extremely pessimistic. > > If there is an invalidation running concurrently, it is effectively just > > a complex busy wait loop because its local mmu_notifier_retry_cache() > > function will always return true. > > > > It ends up functioning as a very unfair read/write lock. If userspace is > > acting as a 'writer', performing many unrelated MM changes, then the > > hva_to_pfn_retry() function acting as the 'reader' just backs off and > > keep retrying for ever, not making any progress. > > > > Solve this by introducing a separate 'validating' flag to the GPC, so > > that it can be marked invalid before it's even mapped. This allows the > > invalidation to be moved to the range_end hook, and the retry loop in > > hva_to_pfn_retry() can be changed to loop only if its particular uHVA > > has been affected. > > I think I'm missing something. How does allowing hva_to_pfn_retry() allow KVM > as a whole to make forward progress? Doesn't getting past hva_to_pfn_retry() > just move the problem to kvm_gpc_check()? > > kvm_gpc_refresh() can't be called with gpc->lock held, and nor does it return > with gpc->lock held, so a racing mmu_notifier invalidation can/will acquire > gpc->lock and clear gpc->active, no? > > Oh, by "unrelated", you mean _completely_ unrelated? As in, KVM happens to do a > refresh when userspace is blasting MADV_DONTNEED, and gets stuck retrying for > no good reason? Right. And again, I have no idea why userspace is doing that, and we need to make it stop. But that's not really the point; the kernel should be able to make progress anyway. > Servicing guest pages faults has the same problem, which is why > mmu_invalidate_retry_gfn() was added. Supporting hva-only GPCs made our lives a > little harder, but not horrifically so (there are ordering differences regardless). > > Woefully incomplete, but I think this is the gist of what you want: Hm, maybe. It does mean that migration occurring all through memory (indeed, just one at top and bottom of guest memory space) would perturb GPCs which remain present. > > @@ -849,6 +837,8 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, > > wake = !kvm->mn_active_invalidate_count; > > spin_unlock(&kvm->mn_invalidate_lock); > > > > + gfn_to_pfn_cache_invalidate(kvm, range->start, range->end); > > We can't do this. The contract with mmu_notifiers is that secondary MMUs must > unmap the hva before returning from invalidate_range_start(), and must not create > new mappings until invalidate_range_end(). But in the context of the GPC, it is only "mapped" when the ->valid bit is set. Even the invalidation callback just clears the valid bit, and that means nobody is allowed to dereference the ->khva any more. It doesn't matter that the underlying (stale) PFN is still kmapped. Can we not apply the same logic to the hva_to_pfn_retry() loop? Yes, it might kmap a page that gets removed, but it's not actually created a new mapping if it hasn't set the ->valid bit. I don't think this version quite meets the constraints, and I might need to hook *both* the start and end notifiers, and might not like it once I get there. But I'll have a go...
Attachment:
smime.p7s
Description: S/MIME cryptographic signature