Re: [RFC PATCH 0/6] KVM: x86: async PF user

Nikita Kalyazin <kalyazin@xxxxxxxxxx> · Wed, 12 Feb 2025 18:14:09 +0000

On 11/02/2025 21:17, Sean Christopherson wrote:
On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
Async PF [1] allows to run other processes on a vCPU while the host
handles a stage-2 fault caused by a process on that vCPU. When using
VM-exit-based stage-2 fault handling [2], async PF functionality is lost
because KVM does not run the vCPU while a fault is being handled so no
other process can execute on the vCPU. This patch series extends
VM-exit-based stage-2 fault handling with async PF support by letting
userspace handle faults instead of the kernel, hence the "async PF user"
name.

I circulated the idea with Paolo, Sean, David H, and James H at the LPC,
and the only concern I heard was about injecting the "page not present"
event via #PF exception in the CoCo case, where it may not work. In my
implementation, I reused the existing code for doing that, so the async
PF user implementation is on par with the present async PF
implementation in this regard, and support for the CoCo case can be
added separately.

Please note that this series is applied on top of the VM-exit-based
stage-2 fault handling RFC [2].

...

Nikita Kalyazin (6):
   Documentation: KVM: add userfault KVM exit flag
   Documentation: KVM: add async pf user doc
   KVM: x86: add async ioctl support
   KVM: trace events: add type argument to async pf
   KVM: x86: async_pf_user: add infrastructure
   KVM: x86: async_pf_user: hook to fault handling and add ioctl

  Documentation/virt/kvm/api.rst  |  35 ++++++
  arch/x86/include/asm/kvm_host.h |  12 +-
  arch/x86/kvm/Kconfig            |   7 ++
  arch/x86/kvm/lapic.c            |   2 +
  arch/x86/kvm/mmu/mmu.c          |  68 ++++++++++-
  arch/x86/kvm/x86.c              | 101 +++++++++++++++-
  arch/x86/kvm/x86.h              |   2 +
  include/linux/kvm_host.h        |  30 +++++
  include/linux/kvm_types.h       |   1 +
  include/trace/events/kvm.h      |  50 +++++---
  include/uapi/linux/kvm.h        |  12 +-
  virt/kvm/Kconfig                |   3 +
  virt/kvm/Makefile.kvm           |   1 +
  virt/kvm/async_pf.c             |   2 +-
  virt/kvm/async_pf_user.c        | 197 ++++++++++++++++++++++++++++++++
  virt/kvm/async_pf_user.h        |  24 ++++
  virt/kvm/kvm_main.c             |  14 +++
  17 files changed, 535 insertions(+), 26 deletions(-)

I am supportive of the idea, but there is way too much copy+paste in this series.
Hi Sean,

Yes, like I mentioned in the cover letter, I left the new implementation 
isolated on purpose to make the scope of the change clear.  There is 
certainly lots of duplication that should be removed later on.

And it's not just the code itself, it's all the structures and concepts.  Off the
top of my head, I can't think of any reason there needs to be a separate queue,
separate lock(s), etc.  The only difference between kernel APF and user APF is
what chunk of code is responsible for faulting in the page.

There are two queues involved:
 - "queue": stores in-flight faults. APF-kernel uses it to cancel all 
works if needed.  APF-user does not have a way to "cancel" userspace 
works, but it uses the queue to look up the struct by the token when 
userspace reports a completion.
 - "ready": stores completed faults until KVM finds a chance to tell 
guest about them.

I agree that the "ready" queue can be shared between APF-kernel and 
-user as it's used in the same way.  As for the "queue" queue, do you 
think it's ok to process its elements differently based on the "type" of 
them in a single loop [1] instead of having two separate queues?

[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120

I suspect a good place to start would be something along the lines of the below
diff, and go from there.  Given that KVM already needs to special case the fake
"wake all" items, I'm guessing it won't be terribly difficult to teach the core
flows about userspace async #PF.

That sounds sensible.  I can certainly approach it in a "bottom up" way 
by sparingly adding handling where it's different in APF-user rather 
than adding it side by side and trying to merge common parts.

I'm also not sure that injecting async #PF for all userfaults is desirable.  For
in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
be handled via async #PF.  The obvious answer is to have userspace only enable
userspace async #PF when it's useful, but "an all or nothing" approach isn't
great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
doesn't make sense either :-/

I wasn't able to locate the code that would check whether faulting would 
sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). 
Please let me know if I'm missing something here.

[2] 
https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360

Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
userfault, because the vCPU can't make forward progress until the fault is
resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in

Even though we exit to userspace, in case of APF-user, userspace is 
supposed to VM enter straight after scheduling the async job, which is 
then executed concurrently with the vCPU.

James' series.  Anyways, it definitely doesn't make sense for async #PF, because
the whole point is to let the vCPU run.  Signalling userspace would definitely
add complexity, but only because of the need to communicate the token and wait
for userspace to consume said token.  I'll think more on that.

By signalling userspace you mean a new non-exit-to-userspace mechanism 
similar to UFFD?  What advantage can you see in it over exiting to 
userspace (which already exists in James's series)?


Thanks,
Nikita

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 0ee4816b079a..fc31b47cf9c5 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -177,7 +177,8 @@ void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
   * success, 'false' on failure (page fault has to be handled synchronously).
   */
  bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
-                       unsigned long hva, struct kvm_arch_async_pf *arch)
+                       unsigned long hva, struct kvm_arch_async_pf *arch,
+                       bool userfault)
  {
         struct kvm_async_pf *work;

@@ -202,13 +203,16 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
         work->addr = hva;
         work->arch = *arch;

-       INIT_WORK(&work->work, async_pf_execute);
-
         list_add_tail(&work->queue, &vcpu->async_pf.queue);
         vcpu->async_pf.queued++;
         work->notpresent_injected = kvm_arch_async_page_not_present(vcpu, work);

-       schedule_work(&work->work);
+       if (userfault) {
+               work->userfault = true;
+       } else {
+               INIT_WORK(&work->work, async_pf_execute);
+               schedule_work(&work->work);
+       }

         return true;
  }