On Thu, Jan 21, 2021 at 04:09:09PM +0000, John Levon wrote: > > Hi, please take a look. For context, this addition is against the current > https://github.com/nutanix/libvfio-user/blob/master/docs/vfio-user.rst > specification. > > kvm@ readers: Stefan suggested this may be of interest from a VFIO point of > view, in case there is any potential cross-over in defining how to wire up these > fds. After talking to Alex Williamson about including these APIs in the kernel VFIO ioctls it became clear to me that there is an architectural difference between classic vfio-pci devices and vfio-user devices: The current model with kernel VFIO is that the VMM sets up ioeventfds using ioctl(VFIO_DEVICE_IOEVENTFD). The VMM has device-specific knowledge that allows it to add ioeventfds for specific hardware registers. In vfio-user the VMM has no knowledge of the specific device. Instead the device tells the VMM how to configure ioeventfds and ioregionfds. There are a few reasons for this: 1. It is not practical to add knowledge of every vfio-user device implementation into the VMM. A cleaner solution is for the device to advertise its desired configuration to the VMM instead. 2. The device implementation may prefer a specific ioeventfd or ioregionfd configuration to fit its multi-threading architecture. A multi-threaded (multi-queue) device implementation may want ioeventfds and ioregionfds to be configured so that the right thread receives certain MMIO/PIO accesses. 3. The device implementation may want to specify ioregionfd's user_data value. This is the application-specific value that is passed along with an MMIO/PIO access. I think mdev fits in somewhere between these two models. I can imagine complex mdev devices wanting to control the ioeventfd and ioregionfd configuration just like vfio-user. But in simple cases it doesn't matter. I suggest including John's ioeventfd/ioregionfd work into the kernel VFIO interface so mdev devices can take advantage of it too. > Note that this is a new message instead of adding a new region capability to > VFIO_USER_DEVICE_GET_REGION_INFO: with a new capability, there's no way for the > server to know if ioeventfd/ioregionfd is actually used/requested by the client > (the client would just have to close those fds if it didn't want to use FDs). An > explicit new call is much clearer for this. > > The ioregionfd feature itself is at proposal stage, so there's a good chance of > further changes depending on that. > > I also have these pending issues so far: > > 1) I'm not familiar with CCW bus, so don't know if > KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY flag makes sense or is supportable in > vfio-user context > > 2) if ioregionfd subsumes all ioeventfd use cases, we can remove the distinction > from this API, and the client caller can translate to ioregionfd or ioeventfd as > needed > > 3) is it neccessary for the client to indicate support (e.g. lacking ioregionfd > support) ? > > 4) (ioregionfd issue) region_id is 4 bytes, which seems a little awkward from > the server side: this has to encode both the region ID as well as the offset of > the sub-region within that region. Can this be 64 bits wide? The user_data field in Elena's ioregionfd patches is 64 bits. Does this solve the problem? > > thanks > john > > (NB: I edited the diff so the new sections are more readable.) > > diff --git a/docs/vfio-user.rst b/docs/vfio-user.rst > index e3adc7a..e7274a2 100644 > --- a/docs/vfio-user.rst > +++ b/docs/vfio-user.rst > @@ -161,6 +161,17 @@ in the region info reply of a device-specific region. Such regions are reflected > in ``struct vfio_device_info.num_regions``. Thus, for PCI devices this value can > be equal to, or higher than, VFIO_PCI_NUM_REGIONS. > > +Region I/O via file descriptors > +------------------------------- > + > +For unmapped regions, region I/O from the client is done via > +VFIO_USER_REGION_READ/WRITE. As an optimization, ioeventfds or ioregionfds may > +be configured for sub-regions of some regions. A client may request information > +on these sub-regions via VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring the > +returned file descriptors as ioeventfds or ioregionfds, the server can be > +directly notified of I/O (for example, by KVM) without taking a trip through the > +client. > + > Interrupts > ^^^^^^^^^^ > The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query the server for > @@ -293,37 +304,39 @@ Commands > The following table lists the VFIO message command IDs, and whether the > message command is sent from the client or the server. > > -+----------------------------------+---------+-------------------+ > -| Name | Command | Request Direction | > -+==================================+=========+===================+ > -| VFIO_USER_VERSION | 1 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DMA_MAP | 2 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DMA_UNMAP | 3 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DEVICE_GET_INFO | 4 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DEVICE_GET_IRQ_INFO | 6 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DEVICE_SET_IRQS | 7 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_REGION_READ | 8 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_REGION_WRITE | 9 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DMA_READ | 10 | server -> client | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DMA_WRITE | 11 | server -> client | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_VM_INTERRUPT | 12 | server -> client | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DEVICE_RESET | 13 | client -> server | > -+----------------------------------+---------+-------------------+ > -| VFIO_USER_DIRTY_PAGES | 14 | client -> server | > -+----------------------------------+---------+-------------------+ > ++------------------------------------+---------+-------------------+ > +| Name | Command | Request Direction | > ++====================================+=========+===================+ > +| VFIO_USER_VERSION | 1 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DMA_MAP | 2 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DMA_UNMAP | 3 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DEVICE_GET_INFO | 4 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DEVICE_GET_REGION_IO_FDS | 6 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DEVICE_GET_IRQ_INFO | 7 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DEVICE_SET_IRQS | 8 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_REGION_READ | 9 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_REGION_WRITE | 10 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DMA_READ | 11 | server -> client | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DMA_WRITE | 12 | server -> client | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_VM_INTERRUPT | 13 | server -> client | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DEVICE_RESET | 14 | client -> server | > ++------------------------------------+---------+-------------------+ > +| VFIO_USER_DIRTY_PAGES | 15 | client -> server | > ++------------------------------------+---------+-------------------+ > > > .. Note:: Some VFIO defines cannot be reused since their values are > @@ -1130,6 +1143,161 @@ client must write data on the same order and transction size as read. > If an error occurs then the server must fail the read or write operation. It is > an implementation detail of the client how to handle errors. > > VFIO_USER_DEVICE_GET_REGION_IO_FDS > ---------------------------------- > > Message format > ^^^^^^^^^^^^^^ > > +--------------+------------------------+ > | Name | Value | > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > | Command | 6 | > +--------------+------------------------+ > | Message size | 32 + subregion info | > +--------------+------------------------+ > | Flags | Reply bit set in reply | > +--------------+------------------------+ > | Error | 0/errno | > +--------------+------------------------+ > | Region info | Region IO FD info | > +--------------+------------------------+ > > Clients can access regions via VFIO_USER_REGION_READ/WRITE or, if provided, by > mmap()ing a file descriptor provided by the server. > > VFIO_USER_DEVICE_GET_REGION_IO_FDS provides an alternative access mechanism via > file descriptors. This is an optional feature intended for performance > improvements where an underlying sub-system (such as KVM) supports communication > across such file descriptors to the vfio-user server, without needing to > round-trip through the client. > > The server returns an array describing sub-regions of the given region along > with the server specifies a set of sub-regions and the requested file descriptor > notification mechanism to use for that sub-region. Each sub-region in the > response message may choose to use a different method, as defined below. The > two mechanisms supported in this specification are ioeventfds and ioregionfds. > > A client should then hook up the returned file descriptors with the notification > method requested. > > Region IO FD info format > ^^^^^^^^^^^^^^^^^^^^^^^^ > > +------------+--------+------+ > | Name | Offset | Size | > +============+========+======+ > | argsz | 16 | 4 | > +------------+--------+------+ > | flags | 20 | 4 | > +------------+--------+------+ > | index | 24 | 4 | > +------------+--------+------+ > | count | 28 | 4 | > +------------+--------+------+ > | subregions | 32 | ... | > +------------+--------+------+ > > * *argsz* is the size of the region IO FD info structure plus the > total size of the subregion array. Thus, each array entry "i" is at offset > i * ((argsz - 32) / count) > * *flags* must be zero > * *index* is the index of memory region being queried > * *count* is the number of sub-regions in the array > * *subregions* is the array of Sub-Region IO FD info structures > > The client must set ``flags`` to zero and specify the region being queried in > the ``index``. > > The client sets the ``argsz`` field to indicate the maximum size of the response > that the server can send, which must be at least the size of the response header > plus space for the subregion array. If the full response size exceeds ``argsz``, > then the server must respond only with the response header and the Region IO FD > info structure, setting in ``argsz`` the buffer size required to store the full > response. In this case, no file descriptors are passed back. The client then > retries the operation with a larger receive buffer. > > The reply message will additionally include at least one file descriptor in the > ancillary data. Note that more than one subregion may share the same file > descriptor. > > Each sub-region given in the response has one of two possible structures, > depending whether *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` or > `VFIO_USER_IO_FD_TYPE_IOREGIONFD`: > > Sub-Region IO FD info format (ioeventfd) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > +-----------+--------+------+ > | Name | Offset | Size | > +===========+========+======+ > | offset | 0 | 8 | > +-----------+--------+------+ > | size | 8 | 8 | > +-----------+--------+------+ > | fd_index | 16 | 4 | > +-----------+--------+------+ > | type | 20 | 4 | > +-----------+--------+------+ > | flags | 24 | 4 | > +-----------+--------+------+ > | padding | 28 | 4 | > +-----------+--------+------+ > | datamatch | 32 | 8 | > +-----------+--------+------+ > > * *offset* is the offset of the start of the sub-region within the region > requested ("physical address offset" for the region) > * *size* is the length of the sub-region. This may be zero if the access size is > not relevant, which may allow for optimizations > * *fd_index* is the index in the ancillary data of the FD to use for ioeventfd > notification; it may be shared. > * *type* is `VFIO_USER_IO_FD_TYPE_IOEVENTFD` > * *flags* is any of: > * `KVM_IOEVENTFD_FLAG_DATAMATCH` > * `KVM_IOEVENTFD_FLAG_PIO` > * `KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY` (FIXME: makes sense?) > * *datamatch* is the datamatch value if needed > > See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt 4.59 > KVM_IOEVENTFD for further context on the ioeventfd-specific fields. > > Sub-Region IO FD info format (ioregionfd) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > +-----------+--------+------+ > | Name | Offset | Size | > +===========+========+======+ > | offset | 0 | 8 | > +-----------+--------+------+ > | size | 8 | 8 | > +-----------+--------+------+ > | fd_index | 16 | 4 | > +-----------+--------+------+ > | type | 20 | 4 | > +-----------+--------+------+ > | flags | 24 | 4 | > +-----------+--------+------+ > | region_id | 28 | 4 | > +-----------+--------+------+ > > * *offset* is the offset of the start of the sub-region within the region > requested ("physical address offset" for the region) > * *size* is the length of the sub-region. FIXME: may allow zero? > * *fd_index* is the index in the ancillary data of the FD to use for ioregionfd > messages; it may be shared > * *type* is `VFIO_USER_IO_FD_TYPE_IOREGIONFD` > * *flags* is any of: > * `KVM_IOREGIONFD_FLAG_PIO` > * `KVM_IOREGIONFD_FLAG_POSTED_WRITES` > * *region_id* is an opaque value passed back to the server via a message on the > file descriptor > > See https://www.spinics.net/lists/kvm/msg208139.html (FIXME) for further context > on the ioregionfd-specific fields. > > VFIO_USER_DEVICE_GET_IRQ_INFO > ----------------------------- > > @@ -1141,7 +1309,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 6 | > +| Command | 7 | > +--------------+------------------------+ > | Message size | 32 | > +--------------+------------------------+ > @@ -1212,7 +1380,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 7 | > +| Command | 8 | > +--------------+------------------------+ > | Message size | 36 + any data | > +--------------+------------------------+ > @@ -1370,7 +1538,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 8 | > +| Command | 9 | > +--------------+------------------------+ > | Message size | 32 + data size | > +--------------+------------------------+ > @@ -1397,7 +1565,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 9 | > +| Command | 10 | > +--------------+------------------------+ > | Message size | 32 + data size | > +--------------+------------------------+ > @@ -1424,7 +1592,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 10 | > +| Command | 11 | > +--------------+------------------------+ > | Message size | 28 + data size | > +--------------+------------------------+ > @@ -1451,7 +1619,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 11 | > +| Command | 12 | > +--------------+------------------------+ > | Message size | 28 + data size | > +--------------+------------------------+ > @@ -1478,7 +1646,7 @@ Message format > +================+========================+ > | Message ID | <ID> | > +----------------+------------------------+ > -| Command | 12 | > +| Command | 13 | > +----------------+------------------------+ > | Message size | 20 | > +----------------+------------------------+ > @@ -1515,7 +1683,7 @@ Message format > +==============+========================+ > | Message ID | <ID> | > +--------------+------------------------+ > -| Command | 13 | > +| Command | 14 | > +--------------+------------------------+ > | Message size | 16 | > +--------------+------------------------+ > @@ -1537,7 +1705,7 @@ Message format > +====================+========================+ > | Message ID | <ID> | > +--------------------+------------------------+ > -| Command | 14 | > +| Command | 15 | > +--------------------+------------------------+ > | Message size | 16 | > +--------------------+------------------------+ >
Attachment:
signature.asc
Description: PGP signature