On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > This means we cannot define an input that has a magic HW specific > > value. > > I'm not entirely sure what you mean by that. I mean if you make a general property 'foo' that userspace must specify correctly then your API isn't general anymore. Userspace must know if it is A or B HW to set foo=A or foo=B. Supported IOVA ranges are easially like that as every IOMMU is different. So DPDK shouldn't provide such specific or binding information. > No, I don't think that needs to be a condition. I think it's > perfectly reasonable for a constraint to be given, and for the host > IOMMU to just say "no, I can't do that". But that does mean that each > of these values has to have an explicit way of userspace specifying "I > don't care", so that the kernel will select a suitable value for those > instead - that's what DPDK or other userspace would use nearly all the > time. My feeling is that qemu should be dealing with the host != target case, not the kernel. The kernel's job should be to expose the IOMMU HW it has, with all features accessible, to userspace. Qemu's job should be to have a userspace driver for each kernel IOMMU and the internal infrastructure to make accelerated emulations for all supported target IOMMUs. In other words, it is not the kernel's job to provide target IOMMU emulation. The kernel should provide truely generic "works everywhere" interface that qemu/etc can rely on to implement the least accelerated emulation path. So when I see proposals to have "generic" interfaces that actually require very HW specific setup, and cannot be used by a generic qemu userpace driver, I think it breaks this model. If qemu needs to know it is on PPC (as it does today with VFIO's PPC specific API) then it may as well speak PPC specific language and forget about pretending to be generic. This approach is grounded in 15 years of trying to build these user/kernel split HW subsystems (particularly RDMA) where it has become painfully obvious that the kernel is the worst place to try and wrangle really divergent HW into a "common" uAPI. This is because the kernel/user boundary is fixed. Introducing anything generic here requires a lot of time, thought, arguing and risk. Usually it ends up being done wrong (like the PPC specific ioctls, for instance) and when this happens we can't learn and adapt, we are stuck with stable uABI forever. Exposing a device's native programming interface is much simpler. Each device is fixed, defined and someone can sit down and figure out how to expose it. Then that is it, it doesn't need revisiting, it doesn't need harmonizing with a future slightly different device, it just stays as is. The cost, is that there must be a userspace driver component for each HW piece - which we are already paying here! > Ideally the host /dev/iommu will say "ok!", since both those ranges > are within the 0..2^60 translated range of the host IOMMU, and don't > touch the IO hole. When the guest calls the IO mapping hypercalls, > qemu translates those into DMA_MAP operations, and since they're all > within the previously verified windows, they should work fine. For instance, we are going to see HW with nested page tables, user space owned page tables and even kernel-bypass fast IOTLB invalidation. In that world does it even make sense for qmeu to use slow DMA_MAP ioctls for emulation? A userspace framework in qemu can make these optimizations and is also necessarily HW specific as the host page table is HW specific.. Jason