Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ah, I forgot to ask a few other questions.

> On Feb 25, 2020, at 12:19 PM, Felipe Franciosi <felipe@xxxxxxxxxxx> wrote:
> 
> Hi,
> 
> This looks amazing, Stefan. The lack of such a mechanism troubled us
> during the development of MUSER and resulted in the slow-path we have
> today for MMIO with register semantics (when writes cannot be
> overwritten before the device emulator has a chance to process them).
> 
> I have added some comments inline, but wanted to first link your
> proposal with an idea that I discussed with Maxim Levitsky back in
> Lyon and evolve on it a little bit. IIRC/IIUC Maxim was keen on a VT-x
> extension where a CPU could IPI another to handle events which would
> normally cause a VMEXIT. That is probably more applicable to the
> standard ioeventfd model, but it got me thinking about PML.
> 
> Bear with me. :)
> 
> In addition to an fd, which could be used for notifications only, the
> wire protocol could append "struct ioregionfd_cmd"s (probably renamed
> to drop "fd") to one or more pages (perhaps a ring buffer of sorts).
> 
> That would only work for writes; reads would still be synchronous.
> 
> The device emulator therefore doesn't have to respond to each write
> command. It could process the whole lot whenever it gets around to it.
> Most importantly (and linking back to the VT-x extension idea), maybe
> we can avoid the VMEXIT altogether if the CPU could take care of
> appending writes to that buffer. Thoughts?
> 
>> On Feb 22, 2020, at 8:19 PM, Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
>> 
>> Hi,
>> I wanted to share this idea with the KVM community and VMM developers.
>> If this isn't relevant to you but you know someone who should
>> participate, please feel free to add them :).
>> 
>> The following is an outline of "ioregionfd", a cross between ioeventfd
>> and KVM memory regions.  This mechanism would be helpful for VMMs that
>> emulate devices in separate processes, muser/VFIO, and to address
>> existing use cases that ioeventfd cannot handle.
>> 
>> Background
>> ----------
>> There are currently two mechanisms for dispatching MMIO/PIO accesses in
>> KVM: returning KVM_EXIT_MMIO/KVM_EXIT_IO from ioctl(KVM_RUN) and
>> ioeventfd.  Some VMMs also use polling to avoid dispatching
>> performance-critical MMIO/PIO accesses altogether.
>> 
>> These mechanisms have shortcomings for VMMs that perform device
>> emulation in separate processes (usually for increased security):
>> 
>> 1. Only one process performs ioctl(KVM_RUN) for a vCPU, so that
>>  mechanism is not available to device emulation processes.
>> 
>> 2. ioeventfd does not store the value written.  This makes it unsuitable
>>  for NVMe Submission Queue Tail Doorbell registers because the value
>>  written is needed by the device emulation process, for example.
>>  ioeventfd also does not support read operations.
>> 
>> 3. Polling does not support computed read operations and only the latest
>>  value written is available to the device emulation process
>>  (intermediate values are overwritten if the guest performs multiple
>>  accesses).
>> 
>> Overview
>> --------
>> This proposal aims to address this gap through a wire protocol and a new
>> KVM API for registering MMIO/PIO regions that use this alternative
>> dispatch mechanism.
>> 
>> The KVM API is used by the VMM to set up dispatch.  The wire protocol is
>> used to dispatch accesses from KVM to the device emulation process.
>> 
>> This new MMIO/PIO dispatch mechanism eliminates the need to return from
>> ioctl(KVM_RUN) in the VMM and then exchange messages with a device
>> emulation process.
>> 
>> Inefficient dispatch to device processes today:
>> 
>>  kvm.ko  <---ioctl(KVM_RUN)---> VMM <---messages---> device
>> 
>> Direct dispatch with the new mechanism:
>> 
>>  kvm.ko  <---ioctl(KVM_RUN)---> VMM
>>    ^
>>    `---new MMIO/PIO mechanism-> device
>> 
>> Even single-process VMMs can take advantage of the new mechanism.  For
>> example, QEMU's emulated NVMe storage controller can implement IOThread
>> support.
>> 
>> No constraint is placed on the device process architecture.  A single
>> process could emulate all devices belonging to the guest, each device
>> could be its own process, or something in between.
>> 
>> Both ioeventfd and traditional KVM_EXIT_MMIO/KVM_EXIT_IO emulation
>> continue to work alongside the new mechanism, but only one of them is
>> used for any given guest address.
>> 
>> KVM API
>> -------
>> The following new KVM ioctl is added:
>> 
>> KVM_SET_IOREGIONFD
>> Capability: KVM_CAP_IOREGIONFD
>> Architectures: all
>> Type: vm ioctl
>> Parameters: struct kvm_ioregionfd (in)
>> Returns: 0 on success, !0 on error
>> 
>> This ioctl adds, modifies, or removes MMIO or PIO regions where guest
>> accesses are dispatched through a given file descriptor instead of
>> returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.
>> 
>> struct kvm_ioregionfd {
>>   __u64 guest_physical_addr;
>>   __u64 memory_size; /* bytes */
>>   __s32 fd;
>>   __u32 region_id;
>>   __u32 flags;
>>   __u8  pad[36];
>> };
>> 
>> /* for kvm_ioregionfd::flags */
>> #define KVM_IOREGIONFD_PIO           (1u << 0)
>> #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)
>> 
>> Regions are deleted by passing zero for memory_size.
>> 
>> MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.
>> 
>> The region_id is an opaque token that is included as part of the write
>> to the file descriptor.  It is typically a unique identifier for this
>> region but KVM does not interpret its value.
>> 
>> Both read and write guest accesses wait until an acknowledgement is
>> received on the file descriptor.  The KVM_IOREGIONFD_POSTED_WRITES flag
>> skips waiting for an acknowledgement on write accesses.  This is
>> suitable for accesses that do not require synchronous emulation, such as
>> doorbell register writes.
>> 
>> Wire protocol
>> -------------
>> The protocol spoken over the file descriptor is as follows.  The device
>> reads commands from the file descriptor with the following layout:
>> 
>> struct ioregionfd_cmd {
>>   __u32 info;
>>   __u32 region_id;
>>   __u64 addr;
>>   __u64 data;
>>   __u8 pad[8];
>> };
>> 
>> /* for ioregionfd_cmd::info */
>> #define IOREGIONFD_CMD_MASK 0xf
>> # define IOREGIONFD_CMD_READ 0
>> # define IOREGIONFD_CMD_WRITE 1
> 
> Why do we need 4 bits for this? I appreciate you want to align the
> next field, but there's SIZE_SHIFT for that; you could have CMD_MASK
> set to 0x1 unless I'm missing something. The reserved space could be
> used for something else in the future.
> 
>> #define IOREGIONFD_SIZE_MASK 0x30
>> #define IOREGIONFD_SIZE_SHIFT 4
>> # define IOREGIONFD_SIZE_8BIT 0
>> # define IOREGIONFD_SIZE_16BIT 1
>> # define IOREGIONFD_SIZE_32BIT 2
>> # define IOREGIONFD_SIZE_64BIT 3
> 
> Christophe already asked about the 64-bit limit. I think that's fine,
> and am assuming that if larger accesses are ever needed they can just
> be split in two commands by KVM?
> 
>> #define IOREGIONFD_NEED_PIO (1u << 6)
>> #define IOREGIONFD_NEED_RESPONSE (1u << 7)
>> 
>> The command is interpreted by inspecting the info field:
>> 
>> switch (cmd.info & IOREGIONFD_CMD_MASK) {
>> case IOREGIONFD_CMD_READ:
>>     /* It's a read access */
>>     break;
>> case IOREGIONFD_CMD_WRITE:
>>     /* It's a write access */
>>     break;
>> default:
>>     /* Protocol violation, terminate connection */
>> }
>> 
>> The access size is interpreted by inspecting the info field:
>> 
>> unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
>> /* where nbytes = pow(2, size) */
>> 
>> The region_id indicates which MMIO/PIO region is being accessed.  This
>> field has no inherent structure but is typically a unique identifier.
>> 
>> The byte offset being accessed within that region is addr.
> 
> It's not clear to me if addr is GPA absolute or an offset. Sounds like
> the latter, in which case isn't it preferable to name this "offset"?
> 
>> 
>> If the command is IOREGIONFD_CMD_WRITE then data contains the value
>> being written.
>> 
>> MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
>> accesses.
>> 
>> When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
>> no response must be sent.  This flag has no effect for
>> IOREGIONFD_CMD_READ commands.
> 
> Christophe already flagged this, too. :)
> 
> That's all I had for now.
> 
> F.
> 
>> 
>> The device sends responses by writing the following structure to the
>> file descriptor:
>> 
>> struct ioregionfd_resp {
>>   __u64 data;
>>   __u32 info;
>>   __u8 pad[20];
>> };
>> 
>> /* for ioregionfd_resp::info */
>> #define IOREGIONFD_RESP_FAILED (1u << 0)
>> 
>> The info field is zero on success.  The IOREGIONFD_RESP_FAILED flag is
>> set on failure.
>> 
>> The data field contains the value read by an IOREGIONFD_CMD_READ
>> command.  This field is zero for other commands.
>> 
>> Does it support polling?
>> ------------------------
>> Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
>> file descriptor.  Poll the io_uring cq ring to detect when the read has
>> completed.
>> 
>> Although this dispatch mechanism incurs more overhead than polling
>> directly on guest RAM, it overcomes the limitations of polling: it
>> supports read accesses as well as capturing written values instead of
>> overwriting them.
>> 
>> Does it obsolete ioeventfd?
>> ---------------------------
>> No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
>> functionality to ioeventfd, there are differences.  The datamatch
>> functionality of ioeventfd is not available and would need to be
>> implemented by the device emulation program.  Due to the counter
>> semantics of eventfds there is automatic coalescing of repeated accesses
>> with ioeventfd.  Overall ioeventfd is lighter weight but also more
>> limited.
>> 
>> How does it scale?
>> ------------------
>> The protocol is synchronous - only one command/response cycle is in
>> flight at a time.  The vCPU will be blocked until the response has been
>> processed anyway.  If another vCPU accesses an MMIO or PIO region with
>> the same file descriptor during this time then it will wait to.

What happens if a vCPU issues an MMIO read access and the kernel task
is blocked reading from the fd, but the userspace counterpart does not respond?

Would the vCPU still respond to SIGIPIs if blocked?

What implications do you see or thoughts do you have for live migration?

Cheers,
Felipe

>> 
>> In practice this is not a problem since per-queue file descriptors can
>> be set up for multi-queue devices.
>> 
>> It is up to the device emulation program whether to handle multiple
>> devices over the same file descriptor or not.
>> 
>> What exactly is the file descriptor (e.g. eventfd, pipe, char device)?
>> ----------------------------------------------------------------------
>> Any file descriptor that supports bidirectional I/O would do.  This
>> rules out eventfds and pipes.  socketpair(AF_UNIX) is a likely
>> candidate.  Maybe a char device will be necessary for improved
>> performance.
>> 
>> Can this be part of KVM_SET_USER_MEMORY_REGION?
>> -----------------------------------------------
>> Maybe.  Perhaps everything can be squeezed into struct
>> kvm_userspace_memory_region but it's only worth doing if the memory
>> region code needs to be reused for this in the first place.  I'm not
>> sure.
>> 
>> What do you think?
>> ------------------
>> I hope this serves as a starting point for improved MMIO/PIO dispatch in
>> KVM.  There are no immediate plans to implement this but I think it will
>> become necessary within the next year or two.
>> 
>> 1. Does it meet your requirements?
>> 2. Are there better alternatives?
>> 
>> Thanks,
>> Stefan
> 





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux