Re: [PATCH 3/4] KVM: Update gfn_to_pfn_cache khva when it moves within the same page

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Thu, 24 Nov 2022 01:31:11 +0100

On Tue, Nov 22, 2022 at 6:25 PM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>
> On Tue, 2022-11-22 at 16:49 +0000, Paul Durrant wrote:
> > > Tags need your real name, not just your email address, i.e. this should be:
> > >     Reviewed-by: Paul Durrant <paul@xxxxxxx>
> >
> > Yes indeed it should. Don't know how I managed to screw that up... It's
> > not like haven't type that properly hundreds of times on Xen patch reviews.
>
> All sorted in the tree I'm curating
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/gpc-fixes
>
> Of those, three are marked as Cc:stable and want to go into the 6.1 release:
>
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>
> The rest (including the runstate compatibility fixes) are less
> critical.

I have picked them into both kvm/master and kvm/queue.

The gpc series probably will be left for 6.3. I had already removed Sean's
bits for the gpc and will rebase on top of your runstate compatibility fixes,
which I'm cherry-picking into kvm/queue.

But wow, is that runstate compatibility patch ugly.  Is it really worth it
having the two separate update paths, one which is ugly because of
BUILD_BUG_ON assertions and one which is ugly because of the
two-page stuff?

Like this (sorry about any word-wrapping, I'll repost it properly
after testing):

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index b246decb53a9..873a0ded3822 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -205,7 +205,7 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
 #ifdef CONFIG_X86_64
         /*
          * Don't leak kernel memory through the padding in the 64-bit
-         * struct if we take the split-page path.
+         * struct.
          */
         memset(&rs, 0, offsetof(struct vcpu_runstate_info, state_entry_time));
 #endif
@@ -251,83 +251,6 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         read_lock_irqsave(&gpc1->lock, flags);
     }

-    /*
-     * The common case is that it all fits on a page and we can
-     * just do it the simple way.
-     */
-    if (likely(!user_len2)) {
-        /*
-         * We use 'int *user_state' to point to the state field, and
-         * 'u64 *user_times' for runstate_entry_time. So the actual
-         * array of time[] in each state starts at user_times[1].
-         */
-        int *user_state = gpc1->khva;
-        u64 *user_times = gpc1->khva + times_ofs;
-
-        /*
-         * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
-         * field. We need to set it (and write-barrier) before writing to the
-         * the rest of the structure, and clear it last. Just as Xen does, we
-         * address the single *byte* in which it resides because it might be
-         * in a different cache line to the rest of the 64-bit word, due to
-         * the (lack of) alignment constraints.
-         */
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info,
state_entry_time) !=
-                 sizeof(uint64_t));
-        BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info,
state_entry_time) !=
-                 sizeof(uint64_t));
-        BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
-
-        update_bit = ((u8 *)(&user_times[1])) - 1;
-        *update_bit = (vx->runstate_entry_time | XEN_RUNSTATE_UPDATE) >> 56;
-        smp_wmb();
-
-        /*
-         * Next, write the new runstate. This is in the *same* place
-         * for 32-bit and 64-bit guests, asserted here for paranoia.
-         */
-        BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
-                 offsetof(struct compat_vcpu_runstate_info, state));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
-                 sizeof(vx->current_runstate));
-        BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
-                 sizeof(vx->current_runstate));
-        *user_state = vx->current_runstate;
-
-        /*
-         * Then the actual runstate_entry_time (with the UPDATE bit
-         * still set).
-         */
-        *user_times = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-
-        /*
-         * Write the actual runstate times immediately after the
-         * runstate_entry_time.
-         */
-        BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
-                 offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
-        BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info,
state_entry_time) !=
-                 offsetof(struct compat_vcpu_runstate_info, time) -
sizeof(u64));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-                 sizeof_field(struct compat_vcpu_runstate_info, time));
-        BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
-                 sizeof(vx->runstate_times));
-        memcpy(user_times + 1, vx->runstate_times, sizeof(vx->runstate_times));
-
-        smp_wmb();
-
-        /*
-         * Finally, clear the 'updating' bit. Don't use &= here because
-         * the compiler may not realise that update_bit and user_times
-         * point to the same place. That's a classic pointer-aliasing
-         * problem.
-         */
-        *update_bit = vx->runstate_entry_time >> 56;
-        smp_wmb();
-
-        goto done_1;
-    }
-
     /*
      * The painful code path. It's split across two pages and we need to
      * hold and validate both GPCs simultaneously. Thankfully we can get
@@ -336,7 +259,7 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
      */
     read_lock(&gpc2->lock);

-    if (!kvm_gpc_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
+    if (user_len2 && !kvm_gpc_check(v->kvm, gpc2, gpc2->gpa, user_len2)) {
         read_unlock(&gpc2->lock);
         read_unlock_irqrestore(&gpc1->lock, flags);

@@ -361,6 +284,20 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         goto retry;
     }

+    /*
+     * The XEN_RUNSTATE_UPDATE bit is the top bit of the state_entry_time
+     * field. We need to set it (and write-barrier) before writing to the
+     * the rest of the structure, and clear it last. Just as Xen does, we
+     * address the single *byte* in which it resides because it might be
+     * in a different cache line to the rest of the 64-bit word, due to
+     * the (lack of) alignment constraints.
+     */
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state_entry_time) !=
+             sizeof(uint64_t));
+    BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info,
state_entry_time) !=
+             sizeof(uint64_t));
+    BUILD_BUG_ON((XEN_RUNSTATE_UPDATE >> 56) != 0x80);
+
     /*
      * Work out where the byte containing the XEN_RUNSTATE_UPDATE bit is.
      */
@@ -370,6 +307,17 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
         update_bit = ((u8 *)gpc2->khva) + times_ofs + sizeof(u64) - 1 -
             user_len1;

+    /*
+     * Next, write the new runstate. This is in the *same* place
+     * for 32-bit and 64-bit guests, asserted here for paranoia.
+     */
+    BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state) !=
+             offsetof(struct compat_vcpu_runstate_info, state));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, state) !=
+             sizeof(vx->current_runstate));
+    BUILD_BUG_ON(sizeof_field(struct compat_vcpu_runstate_info, state) !=
+             sizeof(vx->current_runstate));
+
     /*
      * Create a structure on our stack with everything in the right place.
      * The rs_state pointer points to the start of it, which in the case
@@ -378,29 +326,44 @@ static void kvm_xen_update_runstate_guest(struct
kvm_vcpu *v, bool atomic)
      */
     *rs_state = vx->current_runstate;
     rs.state_entry_time = vx->runstate_entry_time | XEN_RUNSTATE_UPDATE;
-    memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
-
     *update_bit = rs.state_entry_time >> 56;
     smp_wmb();

+    /*
+     * Write the actual runstate times immediately after the
+     * runstate_entry_time.
+     */
+    BUILD_BUG_ON(offsetof(struct vcpu_runstate_info, state_entry_time) !=
+             offsetof(struct vcpu_runstate_info, time) - sizeof(u64));
+    BUILD_BUG_ON(offsetof(struct compat_vcpu_runstate_info,
state_entry_time) !=
+             offsetof(struct compat_vcpu_runstate_info, time) - sizeof(u64));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+             sizeof_field(struct compat_vcpu_runstate_info, time));
+    BUILD_BUG_ON(sizeof_field(struct vcpu_runstate_info, time) !=
+             sizeof(vx->runstate_times));
+    memcpy(rs.time, vx->runstate_times, sizeof(vx->runstate_times));
+
     /*
      * Having constructed the structure starting at *rs_state, copy it
      * into the first and second pages as appropriate using user_len1
      * and user_len2.
      */
     memcpy(gpc1->khva, rs_state, user_len1);
-    memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
+    if (user_len2)
+        memcpy(gpc2->khva, ((u8 *)rs_state) + user_len1, user_len2);
     smp_wmb();

     /*
-     * Finally, clear the XEN_RUNSTATE_UPDATE bit.
+     * Finally, clear the 'updating' bit. Don't use &= here because
+     * the compiler may not realise that update_bit and user_times
+     * point to the same place. That's a classic pointer-aliasing
+     * problem.
      */
     *update_bit = vx->runstate_entry_time >> 56;
     smp_wmb();

     if (user_len2)
         read_unlock(&gpc2->lock);
- done_1:
     read_unlock_irqrestore(&gpc1->lock, flags);

     mark_page_dirty_in_slot(v->kvm, gpc1->memslot, gpc1->gpa >> PAGE_SHIFT);


? Yet another possibility is to introduce a

/* Copy from src to dest_ofs bytes into the combined area pointed to by
 * dest1 (up to dest1_len bytes) and dest2 (the rest). */
void split_memcpy(void *dest1, void *dest2, size_t dest_ofs, size_t
dest1_len, void *src, size_t src_len)

so that the on-stack struct is not needed at all. This makes it possible to
avoid the rs_state hack as well.

It's in kvm/queue only, so there's time to include a new version.

Paolo

> Sean, given that this now includes your patch series which in turn you
> took over from Michal, how would you prefer me to proceed?
>
> David Woodhouse (7):
>       KVM: x86/xen: Validate port number in SCHEDOP_poll
>       KVM: x86/xen: Only do in-kernel acceleration of hypercalls for guest CPL0
>       KVM: x86/xen: Add CPL to Xen hypercall tracepoint
>       MAINTAINERS: Add KVM x86/xen maintainer list
>       KVM: x86/xen: Compatibility fixes for shared runstate area
>       KVM: Update gfn_to_pfn_cache khva when it moves within the same page
>       KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary
>
> Michal Luczaj (6):
>       KVM: Shorten gfn_to_pfn_cache function names
>       KVM: x86: Remove unused argument in gpc_unmap_khva()
>       KVM: Store immutable gfn_to_pfn_cache properties
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()
>       KVM: Clean up hva_to_pfn_retry()
>       KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()
>
> Sean Christopherson (4):
>       KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cache
>       KVM: Do not partially reinitialize gfn=>pfn cache during activation
>       KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpers
>       KVM: Skip unnecessary "unmap" if gpc is already valid during refresh
>
> We can reinstate the 'immutable length' thing on top, if we pick one of
> the discussed options for coping with the fact that for the runstate
> area, it *isn't* immutable. I'm slightly leaning towards just setting
> the length to '1' despite it being a lie.