Re: [PATCH] KVM: x86: Use gfn_to_pfn_cache for steal_time

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 02 Aug 2024 13:03:16 +0100

On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote:
> On vcpu_run, before entering the guest, the update of the steal time
> information causes a page-fault if the page is not present. In our
> scenario, this gets handled by do_user_addr_fault and successively
> handle_userfault since we have the region registered to that.
> 
> handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by
> signals. do_user_addr_fault then busy-retries it if the pending signal
> is non-fatal. This leads to contention of the mmap_lock.

The busy-loop causes so much contention on mmap_lock that post-copy
live migration fails to make progress, and is leading to failures. Yes?

> This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache,
> as gfn_to_pfn_cache ensures page presence for the memory access,
> preventing the contention of the mmap_lock.
> 
> Signed-off-by: Carsten Stollmaier <stollmc@xxxxxxxxxx>

Reviewed-by: David Woodhouse <dwmw@xxxxxxxxxxxx>

I think this makes sense on its own, as it addresses the specific case
where KVM is *likely* to be touching a userfaulted (guest) page. And it
allows us to ditch yet another explicit asm exception handler.

We should note, though, that in terms of the original problem described
above, it's a bit of a workaround. It just means that by using
kvm_gpc_refresh() to obtain the user page, we end up in
handle_userfault() without the FAULT_FLAG_INTERRUPTIBLE flag.

(Note to self: should kvm_gpc_refresh() take fault flags, to allow
interruptible and killable modes to be selected by its caller?)


An alternative workaround (which perhaps we should *also* consider)
looked like this (plus some suitable code comment, of course):

--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs,
         */
        if (user_mode(regs))
                flags |= FAULT_FLAG_USER;
+       else
+               flags &= ~FAULT_FLAG_INTERRUPTIBLE;
 
 #ifdef CONFIG_X86_64
        /*


That would *also* handle arbitrary copy_to_user/copy_from_user() to
userfault pages, which could theoretically hit the same busy loop.

I'm actually tempted to make user access *interruptible* though, and
either add copy_{from,to}_user_interruptible() or change the semantics
of the existing ones (which I believe are already killable).

That would require each architecture implementing interruptible
exceptions, by doing an extable lookup before the retry. Not overly
complex, but needs to be done for all architectures (although not at
once; we could live with not-yet-done architectures just remaining
killable).

Thoughts?

Attachment:
smime.p7s

Description: S/MIME cryptographic signature