Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files

Alex Williamson <alex.williamson@xxxxxxxxxx> · Mon, 26 Sep 2011 12:34:52 -0600

On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote:
> Am 26.09.2011 um 09:51 schrieb David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>:
> 
> > On Fri, Sep 09, 2011 at 08:11:54AM -0500, Stuart Yoder wrote:
> >> Based on the discussions over the last couple of weeks
> >> I have updated the device fd file layout proposal and
> >> tried to specify it a bit more formally.
> >> 
> >> ===============================================================
> >> 
> >> 1.  Overview
> >> 
> >>  This specification describes the layout of device files
> >>  used in the context of vfio, which gives user space
> >>  direct access to I/O devices that have been bound to
> >>  vfio.
> >> 
> >>  When a device fd is opened and read, offset 0x0 contains
> >>  a fixed sized header followed by a number of variable length
> >>  records that describe different characteristics
> >>  of the device-- addressable regions, interrupts, etc.
> >> 
> >>  0x0  +-------------+-------------+
> >>       |         magic             | u32  // identifies this as a vfio
> >> device file
> >>       +---------------------------+         and identifies the type of bus
> >>       |         version           | u32  // specifies the version of this
> >>       +---------------------------+
> >>       |         flags             | u32  // encodes any flags
> >>       +---------------------------+
> >>       |  dev info record 0        |
> >>       |    type                   | u32   // type of record
> >>       |    rec_len                | u32   // length in bytes of record
> >>       |                           |          (including record header)
> >>       |    flags                  | u32   // type specific flags
> >>       |    ...content...          |       // record content, which could
> >>       +---------------------------+       // include sub-records
> >>       |  dev info record 1        |
> >>       +---------------------------+
> >>       |  dev info record N        |
> >>       +---------------------------+
> > 
> > I really should have chimed in on this earlier, but I've been very
> > busy.
> > 
> > Um, not to put too fine a point on it, this is madness.
> > 
> > Yes, it's very flexible and can thereby cover a very wide range of
> > cases.  But it's much, much too complex.  Userspace has to parse a
> > complex, multilayered data structure, with variable length elements
> > just to get an address at which to do IO.  I can pretty much guarantee
> > that if we went with this, most userspace programs using this
> > interface would just ignore this metadata and directly map the
> > offsets at which they happen to know the kernel will put things for
> > the type of device they care about.
> > 
> > _At least_ for PCI, I think the original VFIO layout of each BAR at a
> > fixed, well known offset is much better.  Despite its limitations,
> > just advertising a "device type" ID which describes one of a few fixed
> > layouts would be preferable to this.  I'm still hoping, that we can do
> > a bit better than that.  But we should try really hard to at the very
> > least force the metadata into a simple array of resources each with a
> > fixed size record describing it, even if it means some space wastage
> > with occasionally-used fields.  Anything more complex than that and
> > userspace is just never going to use it properly.
> 
> We will have 2 different types of user space. One wants to be as
> generic as possible and needs all this dynamic information. QEMU would
> fall into this category.
> 
> The other one is device specific drivers in user space. Here
> hardcoding might make sense.
> 
> For the generic interface, we need something that us as verbose as
> possible and lets us enumerate all the device properties, so we can
> properly map and forward them to the guest.
> 
> However, nothing keeps us from mapping BARs always at static offsets
> into the file. If you don't need the generic info, then you don't need
> it.
> 
> Also, if you can come up with an interface that does not have variable
> length descriptors but is still able to export all the required
> generic information, please send a proposal to the list :)
> 

Hi,

The other obvious possibility is a pure ioctl interface.  To match what
this proposal is trying to describe, plus the runtime interfaces, we'd
need something like:

/* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */
#define VFIO_DEVICE_GET_FLAGS			_IOR(, , u64)

/* Return number of mmio/iop/config regions.
 * For PCI this is always 8 (BAR0-5 + ROM + Config) */
#define VFIO_DEVICE_GET_NUM_REGIONS		_IOR(, , int)

/* Return length for region index (may be zero) */
#define VFIO_DEVICE_GET_REGION_LEN		_IOWR(, , u64)

/* Return flags for region index
 * :0 - mmap'able, :1 - read-only, 63:2 - reserved */
#define VFIO_DEVICE_GET_REGION_FLAGS		_IOR(, , u64)

/* Return file offset for region index */
#define VFIO_DEVICE_GET_REGION_OFFSET		_IOWR(, , u64)

/* Return physical address for region index - not implemented for PCI */
#define VFIO_DEVICE_GET_REGION_PHYS_ADDR	_IOWR(, , u64)

/* Return number of IRQs (Not including MSI/MSI-X for PCI) */
#define VFIO_DEVICE_GET_NUM_IRQ			_IOR(, , int)

/* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */
#define VFIO_DEVICE_SET_IRQ_EVENTFD		_IOW(, , int)

/* Unmask IRQ index */
#define VFIO_DEVICE_UNMASK_IRQ			_IOW(, , int)

/* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */
#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(, , int)

/* Return the device tree path for type/index into the user
 * allocated buffer */
struct dtpath {
	u32	type; (0 = region, 1 = IRQ)
	u32	index;
	u32	buf_len;
	char	*buf;
};
#define VFIO_DEVICE_GET_DTPATH			_IOWR(, , struct dtpath)

/* Return the device tree index for type/index */
struct dtindex {
	u32	type; (0 = region, 1 = IRQ)
	u32	index;
	u32	prop_type;
	u32	prop_index;
};
#define VFIO_DEVICE_GET_DTINDEX			_IOWR(, , struct dtindex)

/* Reset the device */
#define VFIO_DEVICE_RESET			_IO(, ,)

/* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
#define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS	_IOW(, , int)
#define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS	_IOW(, , int)

Hope that covers it.  Something I prefer about this interface is that
everything can easily be generated on the fly, whereas reading out a
table from the device means we really need to have that table somewhere
in kernel memory to easily support reading random offsets.  Thoughts?
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html