Re: [PATCH] KVM: pfncache: rework __kvm_gpc_refresh() to fix locking issues

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 17 Jan 2024 19:33:42 +0100

On Wed, 2024-01-17 at 10:14 -0800, Sean Christopherson wrote:
> On Wed, Jan 17, 2024, David Woodhouse wrote:
> > On Wed, 2024-01-17 at 08:51 -0800, Sean Christopherson wrote:
> > > On Wed, Jan 17, 2024, David Woodhouse wrote:
> > > > On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote:
> > > > > On Fri, Jan 12, 2024, David Woodhouse wrote:
> > > > > As you note above, some other mutex _should_ be held.  I think we should lean
> > > > > into that.  E.g.
> > > > 
> > > > I don't. I'd like this code to stand alone *without* making the caller
> > > > depend on "some other lock" just for its own internal consistency.
> > > 
> > > Hmm, I get where you're coming from, but protecting a per-vCPU asset with
> > > vcpu->mutex is completely sane/reasonable.  Xen's refresh from a completely
> > > different task is the oddball.  
> > 
> > Well yes, because that's what the gfn_to_pfn_cache is *for*, surely?
> > 
> > If we were in a context where we could sleep and take mutexes like the
> > vcpu mutex (and without deadlocking with a thread actually running that
> > vCPU), then we could mostly just use kvm_write_guest().
> > 
> > The gfn_to_pfn_cache exists specifically to handle that 'oddball' case.
> 
> No?  As I see it, the main role of the gpc code is to handle the mmu_notifier
> interactions.  I am not at all convinced that the common code should take on
> supporting "oh, and by the way, any task can you use the cache from any context".

But that really was its raison d'être. It exists to allow interrupts to
be delivered, runstate to be updated — all of which runs in a context
that can't just access the userspace address.

The MMU notifiers are just a means to that end — to know when the cache
becomes invalid and we need to take a slow path to refresh it.

> That's the "oddball" I am referring to.  I'm not entirely opposed to making the
> gpc code fully standalone, but I would like to at least get to the point where
> have very high confidence that arch.xen.xen_lock can be fully excised before
> committing to handling that use case in the common gpc code.

I fully agree that we should have high confidence that the
arch.xen.xen_lock isn't needed before we actually elide it.

This patch is very much moving us in the right direction. Even if it
doesn't actually fix a bug, it's a cleanup of some fairly awful locking
abuse. We *shouldn't* drop the rwlock and then assume that fields will
be unchanged when we take it again "because our caller will save us by
doing our locking for us".

And while I can't see *how* the caller is failing to do that locking
for us, I'm definitely looking at a trace where the ->khva being
dereferenced from an allegedly valid gfn_to_pfn_cache is a valid-
looking but not present kernel address.

That's why I was staring at the code until my brain hurt and trying to
work out how it could possibly happen. And I think the awful locking
abuse in __kvm_gpc_refresh() has to be part of it.

I still want to do a full audit of all the possible parallel code paths
before actually removing xen_lock from the code paths where the
gfn_to_pfn_cache is all it protects — but this patch is a very clear
cleanup of some fairly awful locking abuse.

I'm OK with you nacking it as a "bug fix", given the uncertainty. But
let's take it as a cleanup and a simplification, and making the API
easier to use correctly.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature