Any further comments? Thanks, On Wed, Jul 20, 2022 at 08:03:15PM -0400, Peter Xu wrote: > v2: > - Added r-b > - Rewrite the comment in faultin_page() for FOLL_INTERRUPTIBLE [John] > - Dropped the controversial patch to introduce a flag for > __gfn_to_pfn_memslot(), instead used a boolean for now [Sean] > - Rename s/is_sigpending_pfn/KVM_PFN_ERR_SIGPENDING/ [Sean] > - Change comment in kvm_faultin_pfn() mentioning fatal signals [Sean] > > rfc: https://lore.kernel.org/kvm/20220617014147.7299-1-peterx@xxxxxxxxxx > v1: https://lore.kernel.org/kvm/20220622213656.81546-1-peterx@xxxxxxxxxx > > One issue was reported that libvirt won't be able to stop the virtual > machine using QMP command "stop" during a paused postcopy migration [1]. > > It won't work because "stop the VM" operation requires the hypervisor to > kick all the vcpu threads out using SIG_IPI in QEMU (which is translated to > a SIGUSR1). However since during a paused postcopy, the vcpu threads are > hang death at handle_userfault() so there're simply not responding to the > kicks. Further, the "stop" command will further hang the QMP channel. > > The mm has facility to process generic signal (FAULT_FLAG_INTERRUPTIBLE), > however it's only used in the PF handlers only, not in GUP. Unluckily, KVM > is a heavy GUP user on guest page faults. It means we won't be able to > interrupt a long page fault for KVM fetching guest pages with what we have > right now. > > I think it's reasonable for GUP to only listen to fatal signals, as most of > the GUP users are not really ready to handle such case. But actually KVM > is not such an user, and KVM actually has rich infrastructure to handle > even generic signals, and properly deliver the signal to the userspace. > Then the page fault can be retried in the next KVM_RUN. > > This patchset added FOLL_INTERRUPTIBLE to enable FAULT_FLAG_INTERRUPTIBLE, > and let KVM be the first one to use it. KVM and mm/gup can always be able > to respond to fatal signals, but not non-fatal ones until this patchset. > > One thing to mention is that this is not allowing all KVM paths to be able > to respond to non fatal signals, but only on x86 slow page faults. In the > future when more code is ready for handling signal interruptions, we can > explore possibility to have more gup callers using FOLL_INTERRUPTIBLE. > > Tests > ===== > > I created a postcopy environment, pause the migration by shutting down the > network to emulate a network failure (so the handle_userfault() will stuck > for a long time), then I tried three things: > > (1) Sending QMP command "stop" to QEMU monitor, > (2) Hitting Ctrl-C from QEMU cmdline, > (3) GDB attach to the dest QEMU process. > > Before this patchset, all three use case hang. After the patchset, all > work just like when there's not network failure at all. > > Please have a look, thanks. > > [1] https://gitlab.com/qemu-project/qemu/-/issues/1052 -- Peter Xu