On Tue, Apr 27, 2021 at 02:24:32PM -0300, Jason Gunthorpe wrote: > On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote: > > > > > I say this because the SPAPR looks quite a lot like PASID when it has > > > > APIs for allocating multiple tables and other things. I would be > > > > interested to hear someone from IBM talk about what it is doing and > > > > how it doesn't fit into today's IOMMU API. > > > > Hm. I don't think it's really like PASID. Just like Type1, the TCE > > backend represents a single DMA address space which all devices in the > > container will see at all times. The difference is that there can be > > multiple (well, 2) "windows" of valid IOVAs within that address space. > > Each window can have a different TCE (page table) layout. For kernel > > drivers, a smallish translated window at IOVA 0 is used for 32-bit > > devices, and a large direct mapped (no page table) window is created > > at a high IOVA for better performance with 64-bit DMA capable devices. > > > > With the VFIO backend we create (but don't populate) a similar > > smallish 32-bit window, userspace can create its own secondary window > > if it likes, though obvious for userspace use there will always be a > > page table. Userspace can choose the total size (but not address), > > page size and to an extent the page table format of the created > > window. Note that the TCE page table format is *not* the same as the > > POWER CPU core's page table format. Userspace can also remove the > > default small window and create its own. > > So what do you need from the generic API? I'd suggest if userspace > passes in the required IOVA range it would benefit all the IOMMU > drivers to setup properly sized page tables and PPC could use that to > drive a single window. I notice this is all DPDK did to support TCE. Yes. My proposed model for a unified interface would be that when you create a new container/IOASID, *no* IOVAs are valid. Before you can map anything you would have to create a window with specified base, size, pagesize (probably some flags for extension, too). That could fail if the backend IOMMU can't handle that IOVA range, it could be a backend no-op if the requested window lies within a fixed IOVA range the backend supports, or it could actually reprogram the back end for the new window (such as for POWER TCEs). Regardless of the hardware, attempts to map outside the created window(s) would be rejected by software. I expect we'd need some kind of query operation to expose limitations on the number of windows, addresses for them, available pagesizes etc. > > The second wrinkle is pre-registration. That lets userspace register > > certain userspace VA ranges (*not* IOVA ranges) as being the only ones > > allowed to be mapped into the IOMMU. This is a performance > > optimization, because on pre-registration we also pre-account memory > > that will be effectively locked by DMA mappings, rather than doing it > > at DMA map and unmap time. > > This feels like nesting IOASIDs to me, much like a vPASID. > > The pre-registered VA range would be the root of the tree and the > vIOMMU created ones would be children of the tree. This could allow > the map operations of the child to refer to already prepped physical > memory held in the root IOASID avoiding the GUP/etc cost. Huh... I never thought of it that way, but yeah, that sounds like it could work. More elegantly than the current system in fact. > Seems fairly genericish, though I'm not sure about the kvm linkage.. I think it should be doable. We'd basically need to give KVM a handle on the parent AS, and the child AS, and the guest side handle (what PAPR calls a "Logical IO Bus Number" - liobn). KVM would then translate H_PUT_TCE etc. hypercalls on that liobn into calls into the IOMMU subsystem to map bits of the parent AS into the child. We'd probably have to have some requirements that either parent AS is identity-mapped to a subset of the userspace AS (effectively what we have now) or that parent AS is the same as guest physical address. Not sure which would work better. > > I like the idea of a common DMA/IOMMU handling system across > > platforms. However in order to be efficiently usable for POWER it > > will need to include multiple windows, allowing the user to change > > those windows and something like pre-registration to amortize > > accounting costs for heavy vIOMMU load. > > I have a feeling /dev/ioasid is going to end up with some HW specific > escape hatch to create some HW specific IOASID types and operate on > them in a HW specific way. > > However, what I would like to see is that something simple like DPDK > can have a single implementation - POWER should implement the standard > operations and map them to something that will work for it. > > As an ideal, only things like the HW specific qemu vIOMMU driver > should be reaching for all the special stuff. I'm hoping we can even avoid that, usually. With the explicitly created windows model I propose above, it should be able to: qemu will create the windows according to the IOVA windows the guest platform expects to see and they either will or won't work on the host platform IOMMU. If they do, generic maps/unmaps should be sufficient. If they don't well, the host IOMMU simply cannot emulate the vIOMMU so you're out of luck anyway. > In this way the kernel IOMMU driver and the qemu user vIOMMU driver > would form something of a classical split user/kernel driver pattern. > > Jason > -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature