Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues

"Woodhouse, David" <dwmw@xxxxxxxxxxxx> · Wed, 17 Jan 2024 16:32:35 +0000

On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> On Fri, Jan 12, 2024, David Woodhouse wrote:
> > This function can race with kvm_gpc_deactivate(). Since that function
> > does not take the ->refresh_lock, it can wipe and unmap the pfn and
> > khva while hva_to_pfn_retry() has dropped its write lock on gpc->lock.
> > 
> > Then if hva_to_pfn_retry() determines that the PFN hasn't changed and
> > that it can re-use the old pfn and khva, they get assigned back to
> > gpc->pfn and gpc->khva even though the khva was already unmapped by
> > kvm_gpc_deactivate(). This leaves the cache in an apparently valid
> > state but with ->khva pointing to an address which has been unmapped.
> > Which in turn leads to oopses in e.g. __kvm_xen_has_interrupt() and
> > set_shinfo_evtchn_pending().
> > 
> > It may be possible to fix this just by making kvm_gpc_deactivate()
> > take the ->refresh_lock, but that still leaves ->refresh_lock being
> > basically redundant with the write lock on ->lock, which frankly
> > makes my skin itch, with the way that pfn_to_hva_retry() operates on
> > fields in the gpc without holding ->lock.
> > 
> > Instead, fix it by cleaning up the semantis of hva_to_pfn_retry(). It
> > no longer operates on the gpc object at all; it's called with a uhva
> > and returns the corresponding pfn (pinned), and a mapped khva for it.
> > 
> > The calling function __kvm_gpc_refresh() now drops ->lock before calling
> > hva_to_pfn_retry(), then retakes the lock before checking for changes,
> > and discards the new mapping if it lost a race. And will correctly
> > note the old pfn/khva to be unmapped at the right time, instead of
> > preserving them in a local variable while dropping the lock.
> > 
> > The optimisation in hva_to_pfn_retry() where it attempts to use the
> > old mapping if the pfn doesn't change is dropped, since it makes the
> > pinning more complex. It's a pointless optimisation anyway, since the
> > odds of the pfn ending up the same when the uhva has changed (i.e.
> > the odds of the two userspace addresses both pointing to the same
> > underlying physical page) are negligible,
> > 
> > I remain slightly confused because although this is clearly a race in
> > the gfn_to_pfn_cache code, I don't quite know how the Xen support code
> > actually managed to trigger it. We've seen oopses from dereferencing a
> > valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info)
> > and in set_shinfo_evtchn_pending() (the shared_info). But surely the
> > race shouldn't happen for the vcpu_info gpc because all calls to both
> > refresh and deactivate hold the vcpu mutex, and it shouldn't happen
> 
> FWIW, neither kvm_xen_destroy_vcpu() nor kvm_xen_destroy_vm() holds the appropriate
> mutex.  

Those shouldn't be implicated in the cases where we've seen it happen.
And I think it needs the GPC to be left in !active,valid state due to
the race and then *reactivated*, while still marked 'valid'. Which
can't happen after the destroy paths.

> 
> > for the shared_info gpc because all calls to both will hold the
> > kvm->arch.xen.xen_lock mutex.
> > 
> > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx>
> > ---
> > 
> > This is based on (and in) my tree at
> > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenfv
> > which has all the other outstanding KVM/Xen fixes.
> > 
> >  virt/kvm/pfncache.c | 181 +++++++++++++++++++++-----------------------
> >  1 file changed, 85 insertions(+), 96 deletions(-)
> 
> NAK, at least as a bug fix.  We've already shuffled deck chairs on the Titanic
> several times, I have zero confidence that doing so one more time is going to
> truly solve the underlying mess.

Agreed, but as it stands, especially with refresh_lock, this is just
overly complex. We should make the rwlock stand alone, and not have
code which drops the lock and then makes assumptions that things won't
change.

> The contract with the gfn_to_pfn_cache, or rather the lack thereof, is all kinds
> of screwed up.  E.g. I added the mutex in commit 93984f19e7bc ("KVM: Fully serialize
> gfn=>pfn cache refresh via mutex") to guard against concurrent unmap(), but the
> unmap() API has since been removed.  We need to define an actual contract instead
> of continuing to throw noodles as the wall in the hope that something sticks.
> 
> As you note above, some other mutex _should_ be held.  I think we should lean
> into that.  E.g.

I don't. I'd like this code to stand alone *without* making the caller
depend on "some other lock" just for its own internal consistency.

>   1. Pass in the guarding mutex to kvm_gpc_init() and assert that said mutex is
>      held for __refresh(), activate(), and deactivate().
>   2. Fix the cases where that doesn't hold true.
>   3. Drop refresh_mutex
> 

I'll go for (3) but I disagree about (1) and (2). Just let the rwlock
work as $DEITY intended, which is what this patch is doing. It's a
cleanup.

(And I didn't drop refresh_lock so far partly because it wants to be
done in a separate commit, but also because it does provide an
optimisation, as noted.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature

Amazon Development Centre (London) Ltd.Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.