Re: [PATCH 2/4] KVM: x86/xen: Compatibility fixes for shared runstate area

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 23 Nov 2022 17:58:14 +0000

On Wed, 2022-11-23 at 17:17 +0000, Sean Christopherson wrote:
> On Tue, Nov 22, 2022, David Woodhouse wrote:
> > Yeah, that's tempting. Going back to gfn_to_hva_cache was the first
> > thing I tried. There are a handful of complexifying factors, none of
> > which are insurmountable if we try hard enough.
> > 
> >  • Even if we fix the gfn_to_hva_cache to still use the fast path for
> >    more than the first page, it's still possible for the runstate to
> >    cross from one memslot to an adjacent one. We probably still need
> >    two of them, which is a large part of the ugliness of this patch.
> 
> Hrm.  What if KVM requires that the runstate be contiguous in host virtual address
> space?  That wouldn't violate Xen's ABI since it's a KVM restriction, and it can't
> break backwards compatibility since KVM doesn't even handle page splits, let alone
> memslot splits.  AFAIK, most (all?) userspace VMMs make guest RAM virtually
> contiguous anyways.
> 
> Probably a moot point since I don't see a way around the "page got swapped out"
> issue.

Right.

> >  • Accessing it via the userspace HVA requires coping with the case
> >    where the vCPU is being scheduled out, and it can't sleep. The
> >    previous code before commit 34d41d19e dealt with that by using 
> >    pagefault_disable() and depending on the GHC fast path. But even
> >    so...
> > 
> >  • We also could end up having to touch page B, page A then page B
> >    again, potentially having to abort and leave the runstate with
> >    the XEN_RUNSTATE_UPDATE bit still set. I do kind of prefer the
> >    version which checks that both pages are present before it starts
> >    writing anything to the guest.
> 
> Ugh, and because of the in-atomic case, there's nothing KVM can do to remedy page B
> getting swapped out.  Actually, it doesn't even require a B=>A=>B pattern.  Even
> without a page split, the backing page could get swapped out between setting and
> clearing.

Indeed. Which is why I quite like the "lock the whole lot down and then
do the update" in my existing patch.

> > As I said, they're not insurmountable but that's why I ended up with
> > the patch you see, after first attempting to use a gfn_to_hva_cache
> > again. Happy to entertain suggestions.
> 
> What if we use a gfn_to_pfn_cache for the page containing XEN_RUNSTATE_UPDATE,
> and then use kvm_vcpu_write_guest_atomic() if there's a page split?  That would
> avoid needing to acquire mutliple gpc locks, and the update could use a single
> code flow by always constructing an on-stack representation.  E.g. very roughly:

Even in the 'split' case, I don't think we want to be doing the full
memslot lookup for kvm_vcpu_write_guest_inatomic() every time we
schedule the guest vCPU in and out. We do want the cache, and a
kvm_vcpu_write_guest_cached_inatomic() function.

So instead of using two GPCs in the split case, that would mean we end
up using a gfn_to_pfn_cache for one page and a gfn_to_hva_cache for the
other?

And with or without that cache, we can *still* end up doing a partial
update if the page goes away. The byte with the XEN_RUNSTATE_UPDATE bit
might still be accessible, but bets are off about what state the rest
of the structure is in — and those runtimes are supposed to add up, or
the guest is going to get unhappy.

I'm actually OK with locking two GPCs. It wasn't my first choice, but
it's reasonable enough IMO given that none of the alternatives jump out
as being particularly attractive either.

> > I note we have a kvm_read_guest_atomic() and briefly pondered adding a
> > 'write' version of same... but that doesn't seem to cope with crossing
> > memslots either; it also assumes its caller only uses it to access
> > within a single page.
> 
> We should fix the lack of page split handling.

Agreed.

Even if it starts with just KVM_BUG_ON(offset + len >= PAGE_SIZE) to
avoid surprises.

I'd *love* to make the GPC cope with page splits, and just give us a
virtually contiguous mapping of up to N pages for some reasonable value
of N. I just can't see how to do that in the IOMEM case.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature