Re: Proposal for MMIO/PIO dispatch file descriptors (ioregionfd)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 24, 2020 at 05:14:25PM +0100, Christophe de Dinechin wrote:
> 
> Stefan Hajnoczi writes:
> 
> > Hi,
> > I wanted to share this idea with the KVM community and VMM developers.
> > If this isn't relevant to you but you know someone who should
> > participate, please feel free to add them :).
> >
> > The following is an outline of "ioregionfd", a cross between ioeventfd
> > and KVM memory regions.  This mechanism would be helpful for VMMs that
> > emulate devices in separate processes, muser/VFIO, and to address
> > existing use cases that ioeventfd cannot handle.
> 
> Looks interesting.
> 
> > This ioctl adds, modifies, or removes MMIO or PIO regions where guest
> > accesses are dispatched through a given file descriptor instead of
> > returning from ioctl(KVM_RUN) with KVM_EXIT_MMIO or KVM_EXIT_PIO.
> 
> What file descriptors can be used for that? Is there an equivalent to
> eventfd(2)? You answer at end of mail seems to be that you could use
> socketpair(AF_UNIX) or a char device. But it seems "weird" to me that
> some arbitrary fd could have its behavior overriden by another process
> doing this KVM ioctl.  Are there precedents for that kind of "fd takeover"
> behavior?

Yes, one example is userspace providing a TCP/IP socket to the NBD
kernel driver.

Think of it as asking the kernel to do read(2)/write(2) on an fd on
behalf of the process.

> >
> > struct kvm_ioregionfd {
> >     __u64 guest_physical_addr;
> >     __u64 memory_size; /* bytes */
> >     __s32 fd;
> >     __u32 region_id;
> >     __u32 flags;
> >     __u8  pad[36];
> > };
> >
> > /* for kvm_ioregionfd::flags */
> > #define KVM_IOREGIONFD_PIO           (1u << 0)
> > #define KVM_IOREGIONFD_POSTED_WRITES (1u << 1)
> >
> > Regions are deleted by passing zero for memory_size.
> 
> For delete and modify, this means you have to match on something.
> Is that GPA only?
> 
> What should happen if you define or zero-size something that is in the
> middle of a previously created region? I assume it's an error?

The answer to these should mirror KVM_SET_USER_MEMORY_REGION unless
there is a strong reason to behave differently.

> What about the fd being closed before/after you delete a region?

The kernel will fget() the struct file while in use to prevent it from
being deleted.

Ownership of the fd belongs to userspace, so userspace must close the fd
after deleting it from KVM.  This is the same as with KVM_IOEVENTFD.

> How small can the region be? Can it be just the size of a doorbell
> register if all other registers for the device could be efficiently
> implemented using memory writes?

The minimum size is 1 byte.

The recommended way of using this API is one region per QEMU
MemoryRegion or VFIO struct vfio_region_info.  Providing finer-grained
regions to KVM is only useful if they differ in the flags.

> >
> > MMIO is the default.  The KVM_IOREGIONFD_PIO flag selects PIO instead.
> 
> Just curious: what use case do you see for PIO? Isn't that detrimental
> to your goal for this to be high-performance and cross-platform?

PCI devices have I/O Space BARs so there must be a way to support them.

> > The region_id is an opaque token that is included as part of the write
> > to the file descriptor.  It is typically a unique identifier for this
> > region but KVM does not interpret its value.
> >
> > Both read and write guest accesses wait until an acknowledgement is
> > received on the file descriptor.
> 
> By "acknowledgement", do you mean data has been read or written on the
> other side, or something else?

The response (struct ioregionfd_resp) has been received.

> > The KVM_IOREGIONFD_POSTED_WRITES flag
> > skips waiting for an acknowledgement on write accesses.  This is
> > suitable for accesses that do not require synchronous emulation, such as
> > doorbell register writes.
> >
> > Wire protocol
> > -------------
> > The protocol spoken over the file descriptor is as follows.  The device
> > reads commands from the file descriptor with the following layout:
> >
> > struct ioregionfd_cmd {
> >     __u32 info;
> >     __u32 region_id;
> >     __u64 addr;
> >     __u64 data;
> >     __u8 pad[8];
> > };
> >
> > /* for ioregionfd_cmd::info */
> > #define IOREGIONFD_CMD_MASK 0xf
> > # define IOREGIONFD_CMD_READ 0
> > # define IOREGIONFD_CMD_WRITE 1
> 
> Maybe "GUEST_READ" and "GUEST_WRITE"?

There are use cases beyond virtualization, like testing or maybe
a "vfio-user-loopback" device.  Let's avoid the term "guest" for the
wire protocol (obviously it's fine when talking about the KVM API).

> > #define IOREGIONFD_SIZE_MASK 0x30
> > #define IOREGIONFD_SIZE_SHIFT 4
> > # define IOREGIONFD_SIZE_8BIT 0
> > # define IOREGIONFD_SIZE_16BIT 1
> > # define IOREGIONFD_SIZE_32BIT 2
> > # define IOREGIONFD_SIZE_64BIT 3
> > #define IOREGIONFD_NEED_PIO (1u << 6)
> > #define IOREGIONFD_NEED_RESPONSE (1u << 7)
> >
> > The command is interpreted by inspecting the info field:
> >
> >   switch (cmd.info & IOREGIONFD_CMD_MASK) {
> >   case IOREGIONFD_CMD_READ:
> >       /* It's a read access */
> >       break;
> >   case IOREGIONFD_CMD_WRITE:
> >       /* It's a write access */
> >       break;
> >   default:
> >       /* Protocol violation, terminate connection */
> >   }
> >
> > The access size is interpreted by inspecting the info field:
> >
> >   unsigned size = (cmd.info & IOREGIONFD_SIZE_MASK) >> IOREGIONFD_SIZE_SHIFT;
> >   /* where nbytes = pow(2, size) */
> 
> What about providing a IOREGIONFD_SIZE(cmd) macro to do that?

Good idea.

> >
> > The region_id indicates which MMIO/PIO region is being accessed.  This
> > field has no inherent structure but is typically a unique identifier.
> >
> > The byte offset being accessed within that region is addr.
> >
> > If the command is IOREGIONFD_CMD_WRITE then data contains the value
> > being written.
> 
> I assume if the guest writes a 8-bit 42, data contains a 64-bit 42
> irrespective of guest and host endianness.

Yes, the data field is native-endian.

> >
> > MMIO is the default.  The IOREGIONFD_NEED_PIO flag is set on PIO
> > accesses.
> >
> > When IOREGIONFD_NEED_RESPONSE is set on a IOREGIONFD_CMD_WRITE command,
> > no response must be sent.  This flag has no effect for
> > IOREGIONFD_CMD_READ commands.
> 
> I find this paragraph confusing. "NEED_RESPONSE" seems to imply the
> response must be sent. Typo? Or do I misunderstand who is supposed to
> send the response?

This was a typo.  It should be "NO_RESPONSE" :).

> Could you clarify the reason for having both POSTED_WRITES and NEED_RESPONSE?

The NO_RESPONSE bit will be set in struct ioregionfd_cmd when the region
has the POSTED_WRITES flag.

We could eliminate this flag from the wire protocol and assume that the
device emulation program knows that certain writes do not have a
response, but it's more flexible to include it.

Also, commands added to the wire protocol in the future might also want
to skip the response, so I think a general-purpose "NO_RESPONSE" name is
better than calling it "POSTED_WRITES" at the wire protocol level.

> >
> > The device sends responses by writing the following structure to the
> > file descriptor:
> >
> > struct ioregionfd_resp {
> >     __u64 data;
> >     __u32 info;
> >     __u8 pad[20];
> > };
> 
> I know you manually optimized for intra-padding here, but do we rule
> 128-bit data forever? :-)

Yeah, I think so :).

> 
> >
> > /* for ioregionfd_resp::info */
> > #define IOREGIONFD_RESP_FAILED (1u << 0)
> 
> What happens when FAILED is set?
> - If the guest still reads data, then how does it know read failed?
> - Otherwise, what happens?

This is a good question.  I don't have a detailed list of errors and how
they would be handled by KVM yet.

> I understand the intent is for the resp to come in the same order as
> the cmd. Is it OK for the same region to be accessed by different vCPUs?
> If so, where do you keep the information about the vCPU that did a cmd
> in order to be able to dispatch the resp back to the vCPU that initiated
> the operation? [Answer below seems that to imply you don't and just
> block the second vCPU in that case]

Yep, the second vCPU waits until the first one is done.

> >
> > The info field is zer oon success.
> 
> typo "zero on"

Thanks!

> 
> > The IOREGIONFD_RESP_FAILED flag is set on failure.
> 
> The device sets it (active voice), or are there other conditions where
> it can be set (maybe state of the fd)?

No other conditions (yet?).

> >
> > The data field contains the value read by an IOREGIONFD_CMD_READ
> > command.  This field is zero for other commands.
> >
> > Does it support polling?
> > ------------------------
> > Yes, use io_uring's IORING_OP_READ to submit an asynchronous read on the
> > file descriptor.  Poll the io_uring cq ring to detect when the read has
> > completed.
> >
> > Although this dispatch mechanism incurs more overhead than polling
> > directly on guest RAM, it overcomes the limitations of polling: it
> > supports read accesses as well as capturing written values instead of
> > overwriting them.
> >
> > Does it obsolete ioeventfd?
> > ---------------------------
> > No, although KVM_IOREGIONFD_POSTED_WRITES offers somewhat similar
> > functionality to ioeventfd, there are differences.  The datamatch
> > functionality of ioeventfd is not available and would need to be
> > implemented by the device emulation program.  Due to the counter
> > semantics of eventfds there is automatic coalescing of repeated accesses
> > with ioeventfd.  Overall ioeventfd is lighter weight but also more
> > limited.
> >
> > How does it scale?
> > ------------------
> > The protocol is synchronous - only one command/response cycle is in
> > flight at a time.  The vCPU will be blocked until the response has been
> > processed anyway.  If another vCPU accesses an MMIO or PIO region with
> > the same file descriptor during this time then it will wait to.
> >
> > In practice this is not a problem since per-queue file descriptors can
> > be set up for multi-queue devices.
> 
> Can a guest write be blocked if user-space is slow reading the fd?

Yes.  vmexits block the vCPU.

POSTED_WRITES avoid this but they can only be used when the semantics of
the registers allows it (e.g.  doorbell registers).  Also, if the fd
write(2) blocks (the socket sndbuf is full) then even a POSTED_WRITES
vCPU blocks.

> What about a guest read? Since the vCPU is blocked anyway, could you
> elaborate how the proposed switch to user-space improves relative to the
> existing one? Seems like a possible win if you have some free CPU that
> can pick up the user-space. If you need to steal a running CPU for your
> user-space, it's less clear to me that there is a win (limit case being
> a single host CPU where you'd just ping-pong between processes).

Today ioctl(KVM_RUN) exits to QEMU, which then has to forward the access
to another process/thread.  That's 2 wakeups.

With ioregionfd the access is directly handled by the device emulator
process.  That's 1 wakeup.

Plus the response needs to make the trip back.

Stefan

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux