On Fri, 2024-08-02 at 11:44 +0000, Carsten Stollmaier wrote: > On vcpu_run, before entering the guest, the update of the steal time > information causes a page-fault if the page is not present. In our > scenario, this gets handled by do_user_addr_fault and successively > handle_userfault since we have the region registered to that. > > handle_userfault uses TASK_INTERRUPTIBLE, so it is interruptible by > signals. do_user_addr_fault then busy-retries it if the pending signal > is non-fatal. This leads to contention of the mmap_lock. The busy-loop causes so much contention on mmap_lock that post-copy live migration fails to make progress, and is leading to failures. Yes? > This patch replaces the use of gfn_to_hva_cache with gfn_to_pfn_cache, > as gfn_to_pfn_cache ensures page presence for the memory access, > preventing the contention of the mmap_lock. > > Signed-off-by: Carsten Stollmaier <stollmc@xxxxxxxxxx> Reviewed-by: David Woodhouse <dwmw@xxxxxxxxxxxx> I think this makes sense on its own, as it addresses the specific case where KVM is *likely* to be touching a userfaulted (guest) page. And it allows us to ditch yet another explicit asm exception handler. We should note, though, that in terms of the original problem described above, it's a bit of a workaround. It just means that by using kvm_gpc_refresh() to obtain the user page, we end up in handle_userfault() without the FAULT_FLAG_INTERRUPTIBLE flag. (Note to self: should kvm_gpc_refresh() take fault flags, to allow interruptible and killable modes to be selected by its caller?) An alternative workaround (which perhaps we should *also* consider) looked like this (plus some suitable code comment, of course): --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1304,6 +1304,8 @@ void do_user_addr_fault(struct pt_regs *regs, */ if (user_mode(regs)) flags |= FAULT_FLAG_USER; + else + flags &= ~FAULT_FLAG_INTERRUPTIBLE; #ifdef CONFIG_X86_64 /* That would *also* handle arbitrary copy_to_user/copy_from_user() to userfault pages, which could theoretically hit the same busy loop. I'm actually tempted to make user access *interruptible* though, and either add copy_{from,to}_user_interruptible() or change the semantics of the existing ones (which I believe are already killable). That would require each architecture implementing interruptible exceptions, by doing an extable lookup before the retry. Not overly complex, but needs to be done for all architectures (although not at once; we could live with not-yet-done architectures just remaining killable). Thoughts?
Attachment:
smime.p7s
Description: S/MIME cryptographic signature