Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)

Stefan Hajnoczi <stefanha@xxxxxxxxxx> · Sat, 22 Feb 2020 20:19:16 +0000

Hi,
I wanted to share this idea with the KVM community and VMM developers.
If this isn't relevant to you but you know someone who should
participate, please feel free to add them :).

The following is an outline of "ioregionfd", a cross between ioeventfd
and KVM memory regions.  This mechanism would be helpful for VMMs that
emulate devices in separate processes, muser/VFIO, and to address
existing use cases that ioeventfd cannot handle.

Background
----------
There are currently two mechanisms for dispatching MMIO/PIO accesses in
KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and
ioeventfd.  Some VMMs also use polling to avoid dispatching
performance-critical MMIO/PIO accesses altogether.

These mechanisms have shortcomings for VMMs that perform device
emulation in separate processes (usually for increased security):

1. Only one process performs ioctl(KVM_RUN) for a vCPU, so that
   mechanism is not available to device emulation processes.

2. ioeventfd does not store the value written.  This makes it unsuitable
   for NVMe Submission Queue Tail Doorbell registers because the value
   written is needed by the device emulation process, for example.
   ioeventfd also does not support read operations.

3. Polling does not support computed read operations and only the latest
   value written is available to the device emulation process
   (intermediate values are overwritten if the guest performs multiple
   accesses).

Overview
--------
This proposal aims to address this gap through a wire protocol and a new
KVM API for registering MMIO/PIO regions that use this alternative
dispatch mechanism.

The KVM API is used by the VMM to set up dispatch.  The wire protocol is
used to dispatch accesses from KVM to the device emulation process.

This new MMIO/PIO dispatch mechanism eliminates the need to return from
ioctl(KVM_RUN) in the VMM and then exchange messages with a device
emulation process.

Inefficient dispatch to device processes today:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM <---messages---> device

Direct dispatch with the new mechanism:

   kvm.ko  <---ioctl(KVM_RUN)---> VMM
     ^
     `---new MMIO/PIO mechanism-> device

Even single-process VMMs can take advantage of the new mechanism.  For
example, QEMU's emulated NVMe storage controller can implement IOThread
support.

No constraint is placed on the device process architecture.  A single
process could emulate all devices belonging to the guest, each device
could be its own process, or something in between.

Both ioeventfd and traditional KVM_EXIT_MMIO/KVM_EXIT_IO emulation
continue to work alongside the new mechanism, but only one of them is
used for any given guest address.

KVM API
-------
The following new KVM ioctl is added:

KVM_SET_IOREGIONFD
Capability: KVM_CAP_IOREGIONFD
Architectures: all
Type: vm ioctl
Parameters: struct kvm_ioregionfd (in)
Returns: 0 on success, !0 on error

This ioctl adds, modifies, or removes MMIO or PIO regions where guest
accesses are dispatched through a given file descriptor instead of
returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.

struct kvm_ioregionfd {
    __u64 guest_physical_addr;
    __u64 memory_size; /* bytes */
    __s32 fd;
    __u32 region_id;
    __u32 flags;
    __u8  pad[36];
};

/* for kvm_ioregionfd::flags */
#define KVM_IOREGIONFD_PIO           (1u << 0)
#define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)

Regions are deleted by passing zero for memory_size.

MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.

The region_id is an opaque token that is included as part of the write
to the file descriptor.  It is typically a unique identifier for this
region but KVM does not interpret its value.

Both read and write guest accesses wait until an acknowledgement is
received on the file descriptor.  The KVM_IOREGIONFD_POSTED_WRITES flag
skips waiting for an acknowledgement on write accesses.  This is
suitable for accesses that do not require synchronous emulation, such as
doorbell register writes.

Wire protocol
-------------
The protocol spoken over the file descriptor is as follows.  The device
reads commands from the file descriptor with the following layout:

struct ioregionfd_cmd {
    __u32 info;
    __u32 region_id;
    __u64 addr;
    __u64 data;
    __u8 pad[8];
};

/* for ioregionfd_cmd::info */
#define IOREGIONFD_CMD_MASK 0xf
# define IOREGIONFD_CMD_READ 0
# define IOREGIONFD_CMD_WRITE 1
#define IOREGIONFD_SIZE_MASK 0x30
#define IOREGIONFD_SIZE_SHIFT 4
# define IOREGIONFD_SIZE_8BIT 0
# define IOREGIONFD_SIZE_16BIT 1
# define IOREGIONFD_SIZE_32BIT 2
# define IOREGIONFD_SIZE_64BIT 3
#define IOREGIONFD_NEED_PIO (1u << 6)
#define IOREGIONFD_NEED_RESPONSE (1u << 7)

The command is interpreted by inspecting the info field:

  switch (cmd.info & IOREGIONFD_CMD_MASK) {
  case IOREGIONFD_CMD_READ:
      /* It's a read access */
      break;
  case IOREGIONFD_CMD_WRITE:
      /* It's a write access */
      break;
  default:
      /* Protocol violation, terminate connection */
  }

The access size is interpreted by inspecting the info field:

  unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
  /* where nbytes = pow(2, size) */

The region_id indicates which MMIO/PIO region is being accessed.  This
field has no inherent structure but is typically a unique identifier.

The byte offset being accessed within that region is addr.

If the command is IOREGIONFD_CMD_WRITE then data contains the value
being written.

MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
accesses.

When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
no response must be sent.  This flag has no effect for
IOREGIONFD_CMD_READ commands.

The device sends responses by writing the following structure to the
file descriptor:

struct ioregionfd_resp {
    __u64 data;
    __u32 info;
    __u8 pad[20];
};

/* for ioregionfd_resp::info */
#define IOREGIONFD_RESP_FAILED (1u << 0)

The info field is zero on success.  The IOREGIONFD_RESP_FAILED flag is
set on failure.

The data field contains the value read by an IOREGIONFD_CMD_READ
command.  This field is zero for other commands.

Does it support polling?
------------------------
Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
file descriptor.  Poll the io_uring cq ring to detect when the read has
completed.

Although this dispatch mechanism incurs more overhead than polling
directly on guest RAM, it overcomes the limitations of polling: it
supports read accesses as well as capturing written values instead of
overwriting them.

Does it obsolete ioeventfd?
---------------------------
No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
functionality to ioeventfd, there are differences.  The datamatch
functionality of ioeventfd is not available and would need to be
implemented by the device emulation program.  Due to the counter
semantics of eventfds there is automatic coalescing of repeated accesses
with ioeventfd.  Overall ioeventfd is lighter weight but also more
limited.

How does it scale?
------------------
The protocol is synchronous - only one command/response cycle is in
flight at a time.  The vCPU will be blocked until the response has been
processed anyway.  If another vCPU accesses an MMIO or PIO region with
the same file descriptor during this time then it will wait to.

In practice this is not a problem since per-queue file descriptors can
be set up for multi-queue devices.

It is up to the device emulation program whether to handle multiple
devices over the same file descriptor or not.

What exactly is the file descriptor (e.g. eventfd, pipe, char device)?
----------------------------------------------------------------------
Any file descriptor that supports bidirectional I/O would do.  This
rules out eventfds and pipes.  socketpair(AF_UNIX) is a likely
candidate.  Maybe a char device will be necessary for improved
performance.

Can this be part of KVM_SET_USER_MEMORY_REGION?
-----------------------------------------------
Maybe.  Perhaps everything can be squeezed into struct
kvm_userspace_memory_region but it's only worth doing if the memory
region code needs to be reused for this in the first place.  I'm not
sure.

What do you think?
------------------
I hope this serves as a starting point for improved MMIO/PIO dispatch in
KVM.  There are no immediate plans to implement this but I think it will
become necessary within the next year or two.

1. Does it meet your requirements?
2. Are there better alternatives?

Thanks,
Stefan
Attachment:
signature.asc

Description: PGP signature