On Mon, 2010-11-22 at 15:21 -0800, Tom Lyon wrote: > VFIO "driver" development has moved to a publicly accessible respository > on github: > > git://github.com/pugs/vfio-linux-2.6.git > > This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio > branch (which is the default). There is a tag 'vfio-v6' marking the latest > "release" of VFIO. > > In addition, I am open-sourcing my user level code which uses VFIO. > It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based > hardware drivers. This code is available at: > > git://github.com/pugs/vfio-user-level-drivers.git So I do have some concerns about this... So first, before I go into the meat of my issues, let's just drop a quick one about the interface: why netlink ? I find it horrible myself... Just confuses everything and adds overhead. ioctl's would have been a better choice imho. Now, my actual issues, which in fact extend to the whole "generic" iommu APIs that have been added to drivers/pci for "domains", and that in turns "stains" VFIO in ways that I'm not sure I can use on POWER... I would appreciate your input on how you think is the best way for me to solve some of these "mismatches" between our HW and this design. Basically, the whole iommu domain stuff has been entirely designed around the idea that you can create those "domains" which are each an entire address space, and put devices in there. This is sadly not how the IBM iommus work on POWER today... I have currently one "shared" DMA address space (per host bridge), but I can assign regions of it to different devices (and I have limited filtering capabilities so basically, a bus per region, a device per region or a function per region). That means essentially that I cannot just create a mapping for the DMA addresses I want, but instead, need to have some kind of "allocator" for DMA translations (which we have in the kernel, ie, dma_map/unmap use a bitmap allocator). I generally have 2 regions per device, one in 32-bit space of quite limited size (some times as small as 128M window) and one in 64-bit space that I can make quite large if I need to, enough to map all of memory if that's really desired, using large pages or something like that). Now that has various consequences vs. the interfaces betweem iommu domains and qemu, and VFIO: - I don't quite see how I can translate the concept of domains and attaching devices to such domains. The basic idea won't work. The domains in my case are essentially pre-existing, not created on-the-fly, and may contain multiple devices tho I suppose I can assume for now that we only support KVM pass-through with 1 device == 1 domain. I don't know how to sort that one out if the userspace or kvm code assumes it can put multiple devices in one domain and they start to magically share the translations... Not sure what the right approach here is. I could make the "Linux" domain some artifical SW construct that contains a list of the real iommu's it's "bound" to and establish translations in all of them... but that isn't very efficient. If the guest kernel explicitely use some iommu PV ops targeting a device, I need to only setup translations for -that- device, not everything in the "domain". - The code in virt/kvm/iommu.c that assumes it can map the entire guest memory 1:1 in the IOMMU is just not usable for us that way. We -might- be able to do that for 64-bit capable devices as we can create quite large regions in the 64-bit space, but at the very least we need some kind of offset, and the guest must know about it... - Similar deal with all the code that currently assume it can pick up a "virtual" address and create a mapping from that. Either we provide an allocator, or if we want to keep the flexibility of userspace/kvm choosing the virtual addresses (preferable), we need to convey some "ranges" information down to the user. - Finally, my guest are always paravirt. There's well defined Hcalls for inserting/removing DMA translations and we're implementing these since existing kernels already know how to use them. That means that overall, I might simply not need to use any of the above. IE. I could have my own infrastructure for iommu, my H-calls populating the target iommu directly from the kernel (kvm) or qemu (via ioctls in the non-kvm case). Might be the best option ... but that would mean somewhat disentangling VFIO from uiommu... Any suggestions ? Great ideas ? Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html