Re: [PATCH v2] KVM: x86: Fix recording of guest steal time / preempted status

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Wed, 03 Nov 2021 09:47:05 +0000

On 2 November 2021 17:19:34 GMT, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>On 02/11/21 18:11, David Woodhouse wrote:
>> On Tue, 2021-11-02 at 18:01 +0100, Paolo Bonzini wrote:
>>> On 02/11/21 17:38, David Woodhouse wrote:
>>>> This kind of makes a mockery of this
>>>> repeated map/unmap dance which I thought was supposed to avoid pinning
>>>> the page
>>>
>>> The map/unmap dance is supposed to catch the moment where you'd look at
>>> a stale cache, by giving the non-atomic code a chance to update the
>>> gfn->pfn mapping.
>>>
>> 
>> It might have *chance* to do so, but it doesn't actually do it.
>> 
>> As noted, a GFN→PFN mapping is really a GFN→HVA→PFN mapping. And the
>> non-atomic code *does* update the GFN→HVA part of that, correctly
>> looking at the memslots generation etc..
>> 
>> But it pays absolutely no attention to the *second* part, and assumes
>> that the HVA→PFN mapping in the userspace page tables will never
>> change.
>> 
>> Which isn't necessarily true, even if the underlying physical page *is*
>> pinned to avoid most cases (ksm, swap, etc.) of the *kernel* changing
>> it. Userspace still can.
>
>Yes, I agree.  What I am saying is that:
>
>- the map/unmap dance is not (entirely) about whether to pin the page
>
>- the map/unmap API is not a bad API, just an incomplete implementation
>
>And I think the above comment confuses both points above.

Sorry, it took me a while to realise that by "above comment" you mean the original commit comment (which you want me to reword) instead of just what I'd said in my previous email. How about this version? If it's OK like this then I can resubmit later today when I get back to a proper keyboard.

In commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") we switched to using a gfn_to_pfn_cache for accessing the guest steal time structure in order to allow for an atomic xchg of the preempted field. This has a couple of problems.

Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a guest vCPU using an IOMEM page for its steal time would never have its preempted field set.

Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it should have been. There are two stages to the GFN → PFN conversion; first the GFN is converted to a userspace HVA, and then that HVA is looked up in the process page tables to find the underlying host PFN. Correct invalidation of the latter would require being hooked up to the MMU notifiers, but that doesn't happen — so it just keeps mapping and unmapping the *wrong* PFN after the userspace page tables change.

In the !IOMEM case at least the stale page *is* pinned all the time it's cached, so it won't be freed and reused by anyone else while still receiving the steal time updates.

To support Xen event channel delivery I will be fixing this up and using the MMU notifiers to mark the mapping invalid at appropriate times — giving us a way to use kvm_map_gfn() safely with an atomic fast path via the kernel mapping, and a slow fallback path for when the mapping needs to be refreshed.

But for steal time reporting there's no point in a kernel mapping of it anyway, when in all cases we care about, we have a perfectly serviceable (and tautologically not stale) userspace HVA for it. We just need to implement the atomic xchg on the userspace address with appropriate exception handling, which is fairly trivial.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.