Hi Vitaly,
On 5/5/20 6:16 PM, Vitaly Kuznetsov wrote:
On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote:
If two page ready notifications happen back to back the second one is not
delivered and the only mechanism we currently have is
kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
only be performed with the next vmexit when it happens and in some cases
it may take a while. With interrupt based page ready notification delivery
the situation is even worse: unlike exceptions, interrupts are not handled
immediately so we must check if the slot is empty. This is slow and
unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
the fact that the slot is free and host should check its notification
queue. Mandate using it for interrupt based type 2 APF event delivery.
Signed-off-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
---
Documentation/virt/kvm/msr.rst | 16 +++++++++++++++-
arch/x86/include/uapi/asm/kvm_para.h | 1 +
arch/x86/kvm/x86.c | 9 ++++++++-
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 7433e55f7184..18db3448db06 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -219,6 +219,11 @@ data:
If during pagefault APF reason is 0 it means that this is regular
page fault.
+ For interrupt based delivery, guest has to write '1' to
+ MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared
+ 'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its
+ queue and deliver next pending notification.
+
During delivery of type 1 APF cr2 contains a token that will
be used to notify a guest when missing page becomes
available. When page becomes available type 2 APF is sent with
@@ -340,4 +345,13 @@ data:
To switch to interrupt based delivery of type 2 APF events guests
are supposed to enable asynchronous page faults and set bit 3 in
- MSR_KVM_ASYNC_PF_EN first.
+
+MSR_KVM_ASYNC_PF_ACK:
+ 0x4b564d07
+
+data:
+ Asynchronous page fault acknowledgment. When the guest is done
+ processing type 2 APF event and 'reason' field in 'struct
+ kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to
+ Bit 0 of the MSR, this caused the host to re-scan its queue and
+ check if there are more notifications pending.
I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK
completely. It seems it's used to trapped to host, to have chance
to check/deliver pending page ready events. If there is no pending
events, no work will be finished in the trap. If it's true, it might
be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK
if there are really pending events?
How does the guest know if host has any pending events or not?
One way would be through struct kvm_vcpu_pv_apf_data, which is visible
to host and guest. In the host, the workers that have completed their
work (GUP) are queued into vcpu->async_pf.done. So we need a bit in
struct kvm_vcpu_pv_apf_data, written by host while read by guest to
see if there pending work. I even don't think the writer/reader need
to be synchronized.
The problem we're trying to address with ACK msr is the following:
imagine host has two 'page ready' notifications back to back. It puts
token for the first on in the slot and raises an IRQ but how do we know
when the slot becomes free so we can inject the second one? Currently,
we have kvm_check_async_pf_completion() check in vcpu_run() loop but
this doesn't guarantee timely delivery of the event, we just hope that
there's going to be a vmexit 'some time soon' and we'll piggy back onto
that. Normally this works but in some special cases it may take really
long before a vmexit happens. Also, we're penalizing each vmexit with an
unneeded check. ACK msr is intended to solve these issues.
Thanks for the explanation. I think vmexit frequency is somewhat proportional
to the workload, meaning more vmexit would be observed if the VM has more
workload. The async page fault is part of the workload. I agree it makes sense
to proactively poke the pending events ('page ready'), but it would be reasonable
to do so conditionally, to avoid overhead. However, I'm not sure how much overhead
it would have on x86 :)
Thanks,
Gavin