On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > > here, it is up to the iommu driver to choose something reasonable for > > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > > x86-like aperture. > > > > > > > > So, using the big aperture means a very high base IOVA > > > > (1<<59)... which means that it won't work at all if you want to attach > > > > any devices that aren't capable of 64-bit DMA. > > > > > > I'd expect to include the 32 bit window too.. > > > > I'm not entirely sure what you mean. Are you working on the > > assumption that we've extended to allowing multiple apertures, so we'd > > default to advertising both a small/low aperture and a large/high > > aperture? > > Yes Ok, that works assuming we can advertise multiple windows. > > > No, this just makes it fragile in the other direction because now > > > userspace has to know what platform specific things to ask for *or it > > > doesn't work at all*. This is not a improvement for the DPDK cases. > > > > Um.. no. The idea is that userspace requests *what it needs*, not > > anything platform specific. In the case of DPDK that would be nothing > > more than the (minimum) aperture size. Nothing platform specific > > about that. > > Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit > platform can do more, but it varies how much more, etc. > > There is no constant value DPDK could stuff in this request, unless it > needs a really small amount of IOVA, like 1G or something. Well, my assumption was that DPDK always wanted an IOVA window to cover its hugepage buffer space. So not "constant" exactly, but a value it will know at start up time. But I think we cover that more closely below. > > > It isn't like there is some hard coded value we can put into DPDK that > > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > > don't see any feasible alternative. > > > > Yes, hence *optionally specified* base address only. > > Okay, so imagine we've already done this and DPDK is not optionally > specifying anything :) > > The structs can be extended so we can add this as an input to creation > when a driver can implement it. > > > > The ppc specific driver would be on the generic side of qemu in its > > > viommu support framework. There is lots of host driver optimization > > > possible here with knowledge of the underlying host iommu HW. It > > > should not be connected to the qemu target. > > > > Thinking through this... > > > > So, I guess we could have basically the same logic I'm suggesting be > > in the qemu backend iommu driver instead. So the target side (machine > > type, strictly speaking) would request of the host side the apertures > > it needs, and the host side driver would see if it can do that, based > > on both specific knowledge of that driver and the query reponses. > > Yes, this is what I'm thinking > > > ppc on x86 should work with that.. at least if the x86 aperture is > > large enough to reach up to ppc's high window. I guess we'd have the > > option here of using either the generic host driver or the > > x86-specific driver. The latter would mean qemu maintaining an > > x86-format shadow of the io pagetables; mildly tedious, but doable. > > The appeal of having userspace page tables is performance, so it is > tedious to shadow, but it should run faster. I doubt the difference is meaningful in the context of an emulated guest, though. > > So... is there any way of backing out of this gracefully. We could > > detach the device, but in the meantime ongoing DMA maps from > > previous devices might have failed. > > This sounds like a good use case for qemu to communicate ranges - but > as I mentioned before Alex said qemu didn't know the ranges.. Yeah, I'm a bit baffled by that, and I don't know the context. Note that there are at least two different very different users of the host IOMMU backends in: one is for emulation of guest DMA (with or without a vIOMMU). In that case the details of the guest platform should let qemu know the ranges. There's also a VFIO based NVME backend; that one's much more like a "normal" userspace driver, where it doesn't care about the address ranges (because they're not guest visible). > > We could pre-attach the new device to a new IOAS and check the > > apertures there - but when we move it to the final IOAS is it > > guaranteed that the apertures will be (at least) the intersection of > > the old and new apertures, or is that just the probable outcome. > > Should be guarenteed Ok; that would need to be documented. > > Ok.. you convinced me. As long as we have some way to handle the > > device hotplug case, we can work with this. > > I like the communicate ranges for hotplug, so long as we can actually > implement it in qemu - I'm a bit unclear on that honestly. > > > Ok, I see. That can certainly be done. I was really hoping we could > > have a working, though non-optimized, implementation using just the > > generic interface. > > Oh, sure that should largely work as well too, this is just an > additional direction people may find interesting and helps explain why > qemu should have an iommu layer inside. > > "holes" versus "windows". We can choose either one; I think "windows" > > rather than "holes" makes more sense, but it doesn't really matter. > > Yes, I picked windows aka ranges for the uAPI - we translate the holes > from the groups into windows and intersect them with the apertures. Ok. > > > > Another approach would be to give the required apertures / pagesizes > > > > in the initial creation of the domain/IOAS. In that case they would > > > > be static for the IOAS, as well as the underlying iommu_domains: any > > > > ATTACH which would be incompatible would fail. > > > > > > This is the device-specific iommu_domain creation path. The domain can > > > have information defining its aperture. > > > > But that also requires managing the pagetables yourself; I think tying > > these two concepts together is inflexible. > > Oh, no, those need to be independent for HW nesting already > > One device-specific creation path will create the kernel owned > map/unmap iommu_domain with device-specific parameters to allow it to > be the root of a nest - ie specify S2 on ARM. > > The second device-specific creation path will create the user owned > iommu_domain with device-specific parameters, with the first as a > parent. > > So you get to do both. Ah! Good to know. > > > Which is why the current scheme is fully automatic and we rely on the > > > iommu driver to automatically select something sane for DPDK/etc > > > today. > > > > But the cost is that the allowable addresses can change implicitly > > with every ATTACH. > > Yes, dpdk/etc don't care. Well... as long as nothing they've already mapped goes away. > > I see the problem if you have an app where there's a difference > > between the smallest window it can cope with versus the largest window > > it can take advantage of. Not sure if that's likely in pratice. > > AFAIK, DPDK will alway require it's hugepage memory pool mapped, can't > > deal with less, can't benefit from more. But maybe there's some use > > case for this I haven't thought of. > > Other apps I've seen don't even have a fixed memory pool, they just > malloc and can't really predict how much IOVA they > need. "approximately the same amount as a process VA" is a reasonable > goal for the kernel to default too. Hm, ok, I guess that makes sense. > A tunable to allow efficiency from smaller allocations sounds great - > but let's have driver support before adding the uAPI for > it. Intel/AMD/ARM support to have fewer page table levels for instance > would be perfect. > > > Ok... here's a revised version of my proposal which I think addresses > > your concerns and simplfies things. > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > will probably need matching changes) > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > is chosen by the kernel within the aperture(s). This is closer to > > how mmap() operates, and DPDK and similar shouldn't care about > > having specific IOVAs, even at the individual mapping level. > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > for when you really do want to control the IOVA (qemu, maybe some > > special userspace driver cases) > > We already did both of these, the flag is called > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > select the IOVA internally. Ok, great. > > - ATTACH will fail if the new device would shrink the aperture to > > exclude any already established mappings (I assume this is already > > the case) > > Yes Good to know. > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > MAPs won't use it, but doesn't actually put anything into the IO > > pagetables. > > - Like a regular mapping, ATTACHes that are incompatible with an > > IOMAP_RESERVEed region will fail > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > mapping > > Yeah, this seems OK, I'm thinking a new API might make sense because > you don't really want mmap replacement semantics but a permanent > record of what IOVA must always be valid. > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > struct iommu_ioas_require_iova { > __u32 size; > __u32 ioas_id; > __u32 num_iovas; > __u32 __reserved; > struct iommu_required_iovas { > __aligned_u64 start; > __aligned_u64 last; > } required_iovas[]; > }; Sounds reasonable. > > So, for DPDK the sequence would be: > > > > 1. Create IOAS > > 2. ATTACH devices > > 3. IOAS_MAP some stuff > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > (Note, not even any need for QUERY in simple cases) > > Yes, this is done already > > > For (unoptimized) qemu it would be: > > > > 1. Create IOAS > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of the > > guest platform > > 3. ATTACH devices (this will fail if they're not compatible with the > > reserved IOVA regions) > > 4. Boot the guest > > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part of > > the reserved regions > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > reserved regions) > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > necessary (which might fail) > > OK, I will take care of it Hooray! Long contentious thread eventually reaches productive resolution :). -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Attachment:
signature.asc
Description: PGP signature