Hi, I wanted to share this idea with the KVM community and VMM developers. If this isn't relevant to you but you know someone who should participate, please feel free to add them :). The following is an outline of "ioregionfd", a cross between ioeventfd and KVM memory regions. This mechanism would be helpful for VMMs that emulate devices in separate processes, muser/VFIO, and to address existing use cases that ioeventfd cannot handle. Background ---------- There are currently two mechanisms for dispatching MMIO/PIO accesses in KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and ioeventfd. Some VMMs also use polling to avoid dispatching performance-critical MMIO/PIO accesses altogether. These mechanisms have shortcomings for VMMs that perform device emulation in separate processes (usually for increased security): 1. Only one process performs ioctl(KVM_RUN) for a vCPU, so that mechanism is not available to device emulation processes. 2. ioeventfd does not store the value written. This makes it unsuitable for NVMe Submission Queue Tail Doorbell registers because the value written is needed by the device emulation process, for example. ioeventfd also does not support read operations. 3. Polling does not support computed read operations and only the latest value written is available to the device emulation process (intermediate values are overwritten if the guest performs multiple accesses). Overview -------- This proposal aims to address this gap through a wire protocol and a new KVM API for registering MMIO/PIO regions that use this alternative dispatch mechanism. The KVM API is used by the VMM to set up dispatch. The wire protocol is used to dispatch accesses from KVM to the device emulation process. This new MMIO/PIO dispatch mechanism eliminates the need to return from ioctl(KVM_RUN) in the VMM and then exchange messages with a device emulation process. Inefficient dispatch to device processes today: kvm.ko <---ioctl(KVM_RUN)---> VMM <---messages---> device Direct dispatch with the new mechanism: kvm.ko <---ioctl(KVM_RUN)---> VMM ^ `---new MMIO/PIO mechanism-> device Even single-process VMMs can take advantage of the new mechanism. For example, QEMU's emulated NVMe storage controller can implement IOThread support. No constraint is placed on the device process architecture. A single process could emulate all devices belonging to the guest, each device could be its own process, or something in between. Both ioeventfd and traditional KVM_EXIT_MMIO/KVM_EXIT_IO emulation continue to work alongside the new mechanism, but only one of them is used for any given guest address. KVM API ------- The following new KVM ioctl is added: KVM_SET_IOREGIONFD Capability: KVM_CAP_IOREGIONFD Architectures: all Type: vm ioctl Parameters: struct kvm_ioregionfd (in) Returns: 0 on success, !0 on error This ioctl adds, modifies, or removes MMIO or PIO regions where guest accesses are dispatched through a given file descriptor instead of returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO. struct kvm_ioregionfd { __u64 guest_physical_addr; __u64 memory_size; /* bytes */ __s32 fd; __u32 region_id; __u32 flags; __u8 pad[36]; }; /* for kvm_ioregionfd::flags */ #define KVM_IOREGIONFD_PIO (1u << 0) #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1) Regions are deleted by passing zero for memory_size. MMIO is the default. The KVM_IOREGIONFD_PIO flag selects PIO instead. The region_id is an opaque token that is included as part of the write to the file descriptor. It is typically a unique identifier for this region but KVM does not interpret its value. Both read and write guest accesses wait until an acknowledgement is received on the file descriptor. The KVM_IOREGIONFD_POSTED_WRITES flag skips waiting for an acknowledgement on write accesses. This is suitable for accesses that do not require synchronous emulation, such as doorbell register writes. Wire protocol ------------- The protocol spoken over the file descriptor is as follows. The device reads commands from the file descriptor with the following layout: struct ioregionfd_cmd { __u32 info; __u32 region_id; __u64 addr; __u64 data; __u8 pad[8]; }; /* for ioregionfd_cmd::info */ #define IOREGIONFD_CMD_MASK 0xf # define IOREGIONFD_CMD_READ 0 # define IOREGIONFD_CMD_WRITE 1 #define IOREGIONFD_SIZE_MASK 0x30 #define IOREGIONFD_SIZE_SHIFT 4 # define IOREGIONFD_SIZE_8BIT 0 # define IOREGIONFD_SIZE_16BIT 1 # define IOREGIONFD_SIZE_32BIT 2 # define IOREGIONFD_SIZE_64BIT 3 #define IOREGIONFD_NEED_PIO (1u << 6) #define IOREGIONFD_NEED_RESPONSE (1u << 7) The command is interpreted by inspecting the info field: switch (cmd.info & IOREGIONFD_CMD_MASK) { case IOREGIONFD_CMD_READ: /* It's a read access */ break; case IOREGIONFD_CMD_WRITE: /* It's a write access */ break; default: /* Protocol violation, terminate connection */ } The access size is interpreted by inspecting the info field: unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT; /* where nbytes = pow(2, size) */ The region_id indicates which MMIO/PIO region is being accessed. This field has no inherent structure but is typically a unique identifier. The byte offset being accessed within that region is addr. If the command is IOREGIONFD_CMD_WRITE then data contains the value being written. MMIO is the default. The IOREGIONFD_NEED_PIO flag is set on PIO accesses. When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command, no response must be sent. This flag has no effect for IOREGIONFD_CMD_READ commands. The device sends responses by writing the following structure to the file descriptor: struct ioregionfd_resp { __u64 data; __u32 info; __u8 pad[20]; }; /* for ioregionfd_resp::info */ #define IOREGIONFD_RESP_FAILED (1u << 0) The info field is zero on success. The IOREGIONFD_RESP_FAILED flag is set on failure. The data field contains the value read by an IOREGIONFD_CMD_READ command. This field is zero for other commands. Does it support polling? ------------------------ Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the file descriptor. Poll the io_uring cq ring to detect when the read has completed. Although this dispatch mechanism incurs more overhead than polling directly on guest RAM, it overcomes the limitations of polling: it supports read accesses as well as capturing written values instead of overwriting them. Does it obsolete ioeventfd? --------------------------- No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar functionality to ioeventfd, there are differences. The datamatch functionality of ioeventfd is not available and would need to be implemented by the device emulation program. Due to the counter semantics of eventfds there is automatic coalescing of repeated accesses with ioeventfd. Overall ioeventfd is lighter weight but also more limited. How does it scale? ------------------ The protocol is synchronous - only one command/response cycle is in flight at a time. The vCPU will be blocked until the response has been processed anyway. If another vCPU accesses an MMIO or PIO region with the same file descriptor during this time then it will wait to. In practice this is not a problem since per-queue file descriptors can be set up for multi-queue devices. It is up to the device emulation program whether to handle multiple devices over the same file descriptor or not. What exactly is the file descriptor (e.g. eventfd, pipe, char device)? ---------------------------------------------------------------------- Any file descriptor that supports bidirectional I/O would do. This rules out eventfds and pipes. socketpair(AF_UNIX) is a likely candidate. Maybe a char device will be necessary for improved performance. Can this be part of KVM_SET_USER_MEMORY_REGION? ----------------------------------------------- Maybe. Perhaps everything can be squeezed into struct kvm_userspace_memory_region but it's only worth doing if the memory region code needs to be reused for this in the first place. I'm not sure. What do you think? ------------------ I hope this serves as a starting point for improved MMIO/PIO dispatch in KVM. There are no immediate plans to implement this but I think it will become necessary within the next year or two. 1. Does it meet your requirements? 2. Are there better alternatives? Thanks, Stefan
Attachment:
signature.asc
Description: PGP signature