Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 9 Sep 2024 13:02:23 -0700

On Mon, Sep 09, 2024, James Houghton wrote:
> On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > +/*
> > + * rmaps and PTE lists are mostly protected by mmu_lock (the shadow MMU always
> > + * operates with mmu_lock held for write), but rmaps can be walked without
> > + * holding mmu_lock so long as the caller can tolerate SPTEs in the rmap chain
> > + * being zapped/dropped _while the rmap is locked_.
> > + *
> > + * Other than the KVM_RMAP_LOCKED flag, modifications to rmap entries must be
> > + * done while holding mmu_lock for write.  This allows a task walking rmaps
> > + * without holding mmu_lock to concurrently walk the same entries as a task
> > + * that is holding mmu_lock but _not_ the rmap lock.  Neither task will modify
> > + * the rmaps, thus the walks are stable.
> > + *
> > + * As alluded to above, SPTEs in rmaps are _not_ protected by KVM_RMAP_LOCKED,
> > + * only the rmap chains themselves are protected.  E.g. holding an rmap's lock
> > + * ensures all "struct pte_list_desc" fields are stable.
> 
> This last sentence makes me think we need to be careful about memory ordering.
> 
> > + */
> > +#define KVM_RMAP_LOCKED        BIT(1)
> > +
> > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > +{
> > +       unsigned long old_val, new_val;
> > +
> > +       old_val = READ_ONCE(rmap_head->val);
> > +       if (!old_val)
> > +               return 0;
> > +
> > +       do {
> > +               /*
> > +                * If the rmap is locked, wait for it to be unlocked before
> > +                * trying acquire the lock, e.g. to bounce the cache line.
> > +                */
> > +               while (old_val & KVM_RMAP_LOCKED) {
> > +                       old_val = READ_ONCE(rmap_head->val);
> > +                       cpu_relax();
> > +               }
> > +
> > +               /*
> > +                * Recheck for an empty rmap, it may have been purged by the
> > +                * task that held the lock.
> > +                */
> > +               if (!old_val)
> > +                       return 0;
> > +
> > +               new_val = old_val | KVM_RMAP_LOCKED;
> > +       } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> 
> I think we (technically) need an smp_rmb() here. I think cmpxchg
> implicitly has that on x86 (and this code is x86-only), but should we
> nonetheless document that we need smp_rmb() (if it indeed required)?
> Perhaps we could/should condition the smp_rmb() on `if (old_val)`.

Hmm, no, not smp_rmb().  If anything, the appropriate barrier here would be
smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
even wrong.

For the !old_val case, there is a address/data dependency that can't be broken by
the CPU without violating the x86 memory model (all future actions with relevant
memory loads depend on rmap_head->val being non-zero).  And AIUI, in the Linux
kernel memory model, READ_ONCE() is responsible for ensuring that the address
dependency can't be morphed into a control dependency by the compiler and
subsequently reordered by the CPU.

I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
__READ_ONCE() promotes the operation to an acquire.

Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
hence my smp_mb__after_spinlock() suggestion.  Though _because_ it's a spinlock,
the rmaps are fully protected by the critical section.  And for the SPTEs, there
is no required ordering.  The reader (aging thread) can observe a !PRESENT or a
PRESENT SPTE, and must be prepared for either.  I.e. there is no requirement that
the reader observe a PRESENT SPTE if there is a valid rmap.

So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
an ordering requirement that doesn't exist.

> kvm_rmap_lock_readonly() should have an smb_rmb(), but it seems like
> adding it here will do the right thing for the read-only lock side.
> 
> > +
> > +       /* Return the old value, i.e. _without_ the LOCKED bit set. */
> > +       return old_val;
> > +}
> > +
> > +static void kvm_rmap_unlock(struct kvm_rmap_head *rmap_head,
> > +                           unsigned long new_val)
> > +{
> > +       WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
> 
> Same goes with having an smp_wmb() here. Is it necessary? If so,
> should it at least be documented?
> 
> And this is *not* necessary for kvm_rmap_unlock_readonly(), IIUC.
> 
> > +       WRITE_ONCE(rmap_head->val, new_val);
> > +}