On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > Ok... here's a revised version of my proposal which I think addresses > > > your concerns and simplfies things. > > > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > > will probably need matching changes) > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > is chosen by the kernel within the aperture(s). This is closer to > > > how mmap() operates, and DPDK and similar shouldn't care about > > > having specific IOVAs, even at the individual mapping level. > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > > for when you really do want to control the IOVA (qemu, maybe some > > > special userspace driver cases) > > > > We already did both of these, the flag is called > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > select the IOVA internally. > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > exclude any already established mappings (I assume this is already > > > the case) > > > > Yes > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > > MAPs won't use it, but doesn't actually put anything into the IO > > > pagetables. > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > IOMAP_RESERVEed region will fail > > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > > mapping > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > you don't really want mmap replacement semantics but a permanent > > record of what IOVA must always be valid. > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > struct iommu_ioas_require_iova { > > __u32 size; > > __u32 ioas_id; > > __u32 num_iovas; > > __u32 __reserved; > > struct iommu_required_iovas { > > __aligned_u64 start; > > __aligned_u64 last; > > } required_iovas[]; > > }; > > As a permanent record do we want to enforce that once the required > range list is set all FIXED and non-FIXED allocations must be within the > list of ranges? No, I would just use this as a guarntee that going forward any get_ranges will always return ranges that cover the listed required ranges. Ie any narrowing of the ranges will be refused. map/unmap should only be restricted to the get_ranges output. Wouldn't burn CPU cycles to nanny userspace here. > If yes we can take the end of the last range as the max size of the iova > address space to optimize the page table layout. I think this API should not interact with the driver. Its only job is to prevent devices from attaching that would narrow the ranges. If we also use it to adjust the aperture of the created iommu_domain then it looses its usefullness as guard since something like qemu would have to leave room for hotplug as well. I suppose optimizing the created iommu_domains should be some other API, with a different set of ranges and the '# of bytes of IOVA' hint as well. > > > For (unoptimized) qemu it would be: > > > > > > 1. Create IOAS > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > the > > > guest platform > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > reserved IOVA regions) > > > 4. Boot the guest > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > vIOMMUs regular mappings are required before booting the guest and > reservation might be done but not mandatory (at least not what current > Qemu vfio can afford as it simply replays valid ranges in the CPU address > space). I think qemu can always do it, it feels like it would simplify error cases around aperture mismatches. Jason