Re: [PATCH v4 15/15] KVM: arm64: Add async PF document

Eric Auger <eauger@xxxxxxxxxx> · Thu, 11 Nov 2021 11:39:46 +0100

Hi Gavin,

On 8/15/21 2:59 AM, Gavin Shan wrote:
> This adds document to explain the interface for asynchronous page
> fault and how it works in general.
> 
> Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
> ---
>  Documentation/virt/kvm/arm/apf.rst   | 143 +++++++++++++++++++++++++++
>  Documentation/virt/kvm/arm/index.rst |   1 +
>  2 files changed, 144 insertions(+)
>  create mode 100644 Documentation/virt/kvm/arm/apf.rst
> 
> diff --git a/Documentation/virt/kvm/arm/apf.rst b/Documentation/virt/kvm/arm/apf.rst
> new file mode 100644
> index 000000000000..4f5c01b6699f
> --- /dev/null
> +++ b/Documentation/virt/kvm/arm/apf.rst
> @@ -0,0 +1,143 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Asynchronous Page Fault Support for arm64
> +=========================================
> +
> +There are two stages of page faults when KVM module is enabled as accelerator
> +to the guest. The guest is responsible for handling the stage-1 page faults,
> +while the host handles the stage-2 page faults. During the period of handling
> +the stage-2 page faults, the guest is suspended until the requested page is
> +ready. It could take several milliseconds, even hundreds of milliseconds in
 s/It could take several milliseconds, even hundreds of milliseconds/It
could take up to hundreds of milliseconds
> +extreme situations because I/O might be required to move the requested page
> +from disk to DRAM. The guest does not do any work when it is suspended. The
> +feature (Asynchronous Page Fault) is introduced to take advantage of the
s/The feature (Asynchronous Page Fault)/ The Asynchronous Page Fault
feature allows to improve the overall performance by allowing the guest
to reschedule ... ?
> +suspending period and to improve the overall performance.
> +
> +There are two paths in order to fulfil the asynchronous page fault, called
> +as control path and data path.
The asynchronous page fault is implemented upon a control path and a
data path?
 The control path allows the VMM or guest to
> +configure the functionality, while the notifications are delivered in data
> +path. The notifications are classified into page-not-present and page-ready
> +notifications.
> +
> +Data Path
> +---------
> +
> +There are two types of notifications delivered from host to guest in the
> +data path: page-not-present and page-ready notification. They are delivered
> +through SDEI event and (PPI) interrupt separately.
s/separately/respectively
 Besides, there is a shared
> +buffer between host and guest to indicate the reason and sequential token,
s/to indicate/that indicates
Can you clarify 'reason'?
Also a sequential token is used ...
> +which is used to identify the asynchronous page fault. The reason and token
> +resident in the shared buffer is written by host, read and cleared by guest.
s/is/are
> +An asynchronous page fault is delivered and completed as below.
> +
> +(1) When an asynchronous page fault starts, a (workqueue) worker is created
> +    and queued to the vCPU's pending queue. The worker makes the requested
> +    page ready and resident to DRAM in the background. The shared buffer is
> +    updated with reason and sequential token. After that, SDEI event is sent
> +    to guest as page-not-present notification.
This gives the impression the SDEI event is sent after the worker
completes the job. I think you should rephrase.
> +
> +(2) When the SDEI event is received on guest, the current process is tagged
> +    with TIF_ASYNC_PF and associated with a wait queue. The process is ready
> +    to keep rescheduling itself on switching from kernel to user mode. After
above sentence sounds a bit cryptic to me: ~ waits for being rescheduled
later?
> +    that, a reschedule IPI is sent to current CPU and the received SDEI event
> +    is acknowledged. Note that the IPI is delivered when the acknowledgment
> +    on the SDEI event is received on host.
> +
> +(3) On the host, the worker is dequeued from the vCPU's pending queue and
> +    enqueued to its completion queue when the requested page becomes ready.
> +    In the mean while, KVM_REQ_ASYNC_PF request is sent the vCPU if the
in the meanwhile here and below
> +    worker is the first element enqueued to the completion queue.
I think you should remind what is the intent of this KVM_REQ_ASYNC_PF
request, ie. notify that the page is ready.
> +
> +(4) With pending KVM_REQ_ASYNC_PF request, the first worker in the completion
> +    queue is dequeued and destroyed. In the mean while, a (PPI) interrupt is

> +    sent to guest with updated reason and token in the shared buffer.
> +
> +(5) When the (PPI) interrupt is received on guest, the affected process is
> +    located using the token and waken up after its TIF_ASYNC_PF tag is cleared.
> +    After that, the interrupt is acknowledged through SMCCC interface. The
> +    workers in the completion queue is dequeued and destroyed if any workers
the worker
Isn't it destroyed even if no other worker exist?
> +    exist, and another (PPI) interrupt is sent to the guest.

I think you should briefly remind the motivation of SDEI and PPI
mechanism for both synchros. Maybe by doing an analogy with x86
implementation?
> +
> +Control Path
> +------------
> +
> +The configurations are passed through SMCCC or ioctl interface. The SDEI
> +event and (PPI) interrupt are owned by VMM, so the SDEI event and interrupt
> +numbers are configured through ioctl command on per-vCPU basis.
The "owned" terminology looks weird here. Do you mean the SDEI event
number and the PPI ID are defined by the VMM userspace?

 Besides,
> +the functionality might be enabled and configured through ioctl interface
> +by VMM during migration:
> +
> +   * KVM_ARM_ASYNC_PF_CMD_GET_VERSION
> +
> +     Returns the current version of the feature, supported by the host. It is
> +     made up of major, minor and revision fields. Each field is one byte in
> +     length.
> +
> +   * KVM_ARM_ASYNC_PF_CMD_GET_SDEI:
> +
> +     Retrieve the SDEI event number, used for page-not-present notification,
> +     so that it can be configured on destination VM in the scenario of
> +     migration.
> +
> +   * KVM_ARM_ASYNC_PF_GET_IRQ:
> +
> +     Retrieve the IRQ (PPI) number, used for page-ready notification, so that
> +     it can be configured on destination VM in the scenario of migration.
> +
> +   * KVM_ARM_ASYNC_PF_CMD_GET_CONTROL
> +
> +     Retrieve the address of control block, so that it can be configured on
> +     destination VM in the scenario of migration.
> +
> +   * KVM_ARM_ASYNC_PF_CMD_SET_SDEI:
> +
> +     Used by VMM to configure number of SDEI event, which is used to deliver
> +     page-not-present notification by host. This is used when VM is started
> +     or migrated.
> +
> +   * KVM_ARM_ASYNC_PF_CMD_SET_IRQ
> +
> +     Used by VMM to configure number of (PPI) interrupt, which is used to
> +     deliver page-ready notification by host. This is used when VM is started
> +     or migrated.
> +
> +   * KVM_ARM_ASYNC_PF_CMD_SET_CONTROL
> +
> +     Set the control block on the destination VM in the scenario of migration.
What is the size of this control block?
> +
> +The other configurations are passed through SMCCC interface. The host exports
> +the capability through KVM vendor specific service, which is identified by
> +ARM_SMCCC_KVM_FUNC_ASYNC_PF_FUNC_ID. There are several functions defined for
> +this:
> +
> +   * ARM_SMCCC_KVM_FUNC_ASYNC_PF_VERSION
> +
> +     Returns the current version of the feature, supported by the host. It is
> +     made up of major, minor and revision fields. Each field is one byte in
> +     length.
> +
> +   * ARM_SMCCC_KVM_FUNC_ASYNC_PF_SLOTS
> +
> +     Returns the size of the hashed GFN table. It is used by guest to set up
by the guest
> +     the capacity of waiting process table.
> +
> +   * ARM_SMCCC_KVM_FUNC_ASYNC_PF_SDEI
> +   * ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ
> +
> +     Used by the guest to retrieve the SDEI event and (PPI) interrupt number
> +     that are configured by VMM.
How does the guest recognize which SDEI event num it shall register.
Same question for PPI? What if we were to expose the guest with several
SDEIs?
> +
> +   * ARM_SMCCC_KVM_FUNC_ASYNC_PF_ENABLE
> +
> +     Used by the guest to enable or disable the feature on the specific vCPU.
> +     The argument is made up of shared buffer and flags. The shared buffer
> +     is written by host to indicate the reason about the delivered asynchronous
> +     page fault and token (sequence number) to identify that. There are two
> +     flags are supported: KVM_ASYNC_PF_ENABLED is used to enable or disable
> +     the feature. KVM_ASYNC_PF_SEND_ALWAYS allows to deliver page-not-present
> +     notification regardless of the guest's state. Otherwise, the notification
> +     is delivered only when the guest is in user mode.
> +
> +   * ARM_SMCCC_KVM_FUNC_ASYNC_PF_IRQ_ACK

How does it compare to x86? I mean there are a huger number of IOCTLs
and SMCCC calls to achieve the functionaly. Was the x86 implementation
as invasive as this implementation?

Thanks

Eric
> +
> +     Used by the guest to acknowledge the completion of page-ready notification.
> diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
> index 78a9b670aafe..f43b5fe25f61 100644
> --- a/Documentation/virt/kvm/arm/index.rst
> +++ b/Documentation/virt/kvm/arm/index.rst
> @@ -7,6 +7,7 @@ ARM
>  .. toctree::
>     :maxdepth: 2
>  
> +   apf
>     hyp-abi
>     psci
>     pvtime
>