On Mon, 2011-09-26 at 12:04 +0200, Alexander Graf wrote: > Am 26.09.2011 um 09:51 schrieb David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>: > > > On Fri, Sep 09, 2011 at 08:11:54AM -0500, Stuart Yoder wrote: > >> Based on the discussions over the last couple of weeks > >> I have updated the device fd file layout proposal and > >> tried to specify it a bit more formally. > >> > >> =============================================================== > >> > >> 1. Overview > >> > >> This specification describes the layout of device files > >> used in the context of vfio, which gives user space > >> direct access to I/O devices that have been bound to > >> vfio. > >> > >> When a device fd is opened and read, offset 0x0 contains > >> a fixed sized header followed by a number of variable length > >> records that describe different characteristics > >> of the device-- addressable regions, interrupts, etc. > >> > >> 0x0 +-------------+-------------+ > >> | magic | u32 // identifies this as a vfio > >> device file > >> +---------------------------+ and identifies the type of bus > >> | version | u32 // specifies the version of this > >> +---------------------------+ > >> | flags | u32 // encodes any flags > >> +---------------------------+ > >> | dev info record 0 | > >> | type | u32 // type of record > >> | rec_len | u32 // length in bytes of record > >> | | (including record header) > >> | flags | u32 // type specific flags > >> | ...content... | // record content, which could > >> +---------------------------+ // include sub-records > >> | dev info record 1 | > >> +---------------------------+ > >> | dev info record N | > >> +---------------------------+ > > > > I really should have chimed in on this earlier, but I've been very > > busy. > > > > Um, not to put too fine a point on it, this is madness. > > > > Yes, it's very flexible and can thereby cover a very wide range of > > cases. But it's much, much too complex. Userspace has to parse a > > complex, multilayered data structure, with variable length elements > > just to get an address at which to do IO. I can pretty much guarantee > > that if we went with this, most userspace programs using this > > interface would just ignore this metadata and directly map the > > offsets at which they happen to know the kernel will put things for > > the type of device they care about. > > > > _At least_ for PCI, I think the original VFIO layout of each BAR at a > > fixed, well known offset is much better. Despite its limitations, > > just advertising a "device type" ID which describes one of a few fixed > > layouts would be preferable to this. I'm still hoping, that we can do > > a bit better than that. But we should try really hard to at the very > > least force the metadata into a simple array of resources each with a > > fixed size record describing it, even if it means some space wastage > > with occasionally-used fields. Anything more complex than that and > > userspace is just never going to use it properly. > > We will have 2 different types of user space. One wants to be as > generic as possible and needs all this dynamic information. QEMU would > fall into this category. > > The other one is device specific drivers in user space. Here > hardcoding might make sense. > > For the generic interface, we need something that us as verbose as > possible and lets us enumerate all the device properties, so we can > properly map and forward them to the guest. > > However, nothing keeps us from mapping BARs always at static offsets > into the file. If you don't need the generic info, then you don't need > it. > > Also, if you can come up with an interface that does not have variable > length descriptors but is still able to export all the required > generic information, please send a proposal to the list :) > Hi, The other obvious possibility is a pure ioctl interface. To match what this proposal is trying to describe, plus the runtime interfaces, we'd need something like: /* :0 - PCI devices, :1 - Devices path device, 63:2 - reserved */ #define VFIO_DEVICE_GET_FLAGS _IOR(, , u64) /* Return number of mmio/iop/config regions. * For PCI this is always 8 (BAR0-5 + ROM + Config) */ #define VFIO_DEVICE_GET_NUM_REGIONS _IOR(, , int) /* Return length for region index (may be zero) */ #define VFIO_DEVICE_GET_REGION_LEN _IOWR(, , u64) /* Return flags for region index * :0 - mmap'able, :1 - read-only, 63:2 - reserved */ #define VFIO_DEVICE_GET_REGION_FLAGS _IOR(, , u64) /* Return file offset for region index */ #define VFIO_DEVICE_GET_REGION_OFFSET _IOWR(, , u64) /* Return physical address for region index - not implemented for PCI */ #define VFIO_DEVICE_GET_REGION_PHYS_ADDR _IOWR(, , u64) /* Return number of IRQs (Not including MSI/MSI-X for PCI) */ #define VFIO_DEVICE_GET_NUM_IRQ _IOR(, , int) /* Set IRQ eventfd for IRQ index, arg[0] = index, arg[1] = fd */ #define VFIO_DEVICE_SET_IRQ_EVENTFD _IOW(, , int) /* Unmask IRQ index */ #define VFIO_DEVICE_UNMASK_IRQ _IOW(, , int) /* Set unmask eventfd for index, arg[0] = index, arg[1] = fd */ #define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(, , int) /* Return the device tree path for type/index into the user * allocated buffer */ struct dtpath { u32 type; (0 = region, 1 = IRQ) u32 index; u32 buf_len; char *buf; }; #define VFIO_DEVICE_GET_DTPATH _IOWR(, , struct dtpath) /* Return the device tree index for type/index */ struct dtindex { u32 type; (0 = region, 1 = IRQ) u32 index; u32 prop_type; u32 prop_index; }; #define VFIO_DEVICE_GET_DTINDEX _IOWR(, , struct dtindex) /* Reset the device */ #define VFIO_DEVICE_RESET _IO(, ,) /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */ #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS _IOW(, , int) #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int) Hope that covers it. Something I prefer about this interface is that everything can easily be generated on the fly, whereas reading out a table from the device means we really need to have that table somewhere in kernel memory to easily support reading random offsets. Thoughts? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html