On Wed, 2024-01-17 at 08:09 -0800, Sean Christopherson wrote: > On Fri, Jan 12, 2024, David Woodhouse wrote: > > This function can race with kvm_gpc_deactivate(). Since that function > > does not take the ->refresh_lock, it can wipe and unmap the pfn and > > khva while hva_to_pfn_retry() has dropped its write lock on gpc->lock. > > > > Then if hva_to_pfn_retry() determines that the PFN hasn't changed and > > that it can re-use the old pfn and khva, they get assigned back to > > gpc->pfn and gpc->khva even though the khva was already unmapped by > > kvm_gpc_deactivate(). This leaves the cache in an apparently valid > > state but with ->khva pointing to an address which has been unmapped. > > Which in turn leads to oopses in e.g. __kvm_xen_has_interrupt() and > > set_shinfo_evtchn_pending(). > > > > It may be possible to fix this just by making kvm_gpc_deactivate() > > take the ->refresh_lock, but that still leaves ->refresh_lock being > > basically redundant with the write lock on ->lock, which frankly > > makes my skin itch, with the way that pfn_to_hva_retry() operates on > > fields in the gpc without holding ->lock. > > > > Instead, fix it by cleaning up the semantis of hva_to_pfn_retry(). It > > no longer operates on the gpc object at all; it's called with a uhva > > and returns the corresponding pfn (pinned), and a mapped khva for it. > > > > The calling function __kvm_gpc_refresh() now drops ->lock before calling > > hva_to_pfn_retry(), then retakes the lock before checking for changes, > > and discards the new mapping if it lost a race. And will correctly > > note the old pfn/khva to be unmapped at the right time, instead of > > preserving them in a local variable while dropping the lock. > > > > The optimisation in hva_to_pfn_retry() where it attempts to use the > > old mapping if the pfn doesn't change is dropped, since it makes the > > pinning more complex. It's a pointless optimisation anyway, since the > > odds of the pfn ending up the same when the uhva has changed (i.e. > > the odds of the two userspace addresses both pointing to the same > > underlying physical page) are negligible, > > > > I remain slightly confused because although this is clearly a race in > > the gfn_to_pfn_cache code, I don't quite know how the Xen support code > > actually managed to trigger it. We've seen oopses from dereferencing a > > valid-looking ->khva in both __kvm_xen_has_interrupt() (the vcpu_info) > > and in set_shinfo_evtchn_pending() (the shared_info). But surely the > > race shouldn't happen for the vcpu_info gpc because all calls to both > > refresh and deactivate hold the vcpu mutex, and it shouldn't happen > > FWIW, neither kvm_xen_destroy_vcpu() nor kvm_xen_destroy_vm() holds the appropriate > mutex. Those shouldn't be implicated in the cases where we've seen it happen. And I think it needs the GPC to be left in !active,valid state due to the race and then *reactivated*, while still marked 'valid'. Which can't happen after the destroy paths. > > > for the shared_info gpc because all calls to both will hold the > > kvm->arch.xen.xen_lock mutex. > > > > Signed-off-by: David Woodhouse <dwmw@xxxxxxxxxxxx> > > --- > > > > This is based on (and in) my tree at > > https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/xenfv > > which has all the other outstanding KVM/Xen fixes. > > > > virt/kvm/pfncache.c | 181 +++++++++++++++++++++----------------------- > > 1 file changed, 85 insertions(+), 96 deletions(-) > > NAK, at least as a bug fix. We've already shuffled deck chairs on the Titanic > several times, I have zero confidence that doing so one more time is going to > truly solve the underlying mess. Agreed, but as it stands, especially with refresh_lock, this is just overly complex. We should make the rwlock stand alone, and not have code which drops the lock and then makes assumptions that things won't change. > The contract with the gfn_to_pfn_cache, or rather the lack thereof, is all kinds > of screwed up. E.g. I added the mutex in commit 93984f19e7bc ("KVM: Fully serialize > gfn=>pfn cache refresh via mutex") to guard against concurrent unmap(), but the > unmap() API has since been removed. We need to define an actual contract instead > of continuing to throw noodles as the wall in the hope that something sticks. > > As you note above, some other mutex _should_ be held. I think we should lean > into that. E.g. I don't. I'd like this code to stand alone *without* making the caller depend on "some other lock" just for its own internal consistency. > 1. Pass in the guarding mutex to kvm_gpc_init() and assert that said mutex is > held for __refresh(), activate(), and deactivate(). > 2. Fix the cases where that doesn't hold true. > 3. Drop refresh_mutex > I'll go for (3) but I disagree about (1) and (2). Just let the rwlock work as $DEITY intended, which is what this patch is doing. It's a cleanup. (And I didn't drop refresh_lock so far partly because it wants to be done in a separate commit, but also because it does provide an optimisation, as noted.
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
Amazon Development Centre (London) Ltd.Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.