> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Thursday, May 12, 2022 12:32 AM > > On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Wednesday, May 11, 2022 3:00 AM > > > > > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > > > Ok... here's a revised version of my proposal which I think addresses > > > > your concerns and simplfies things. > > > > > > > > - No new operations, but IOAS_MAP gets some new flags (and > IOAS_COPY > > > > will probably need matching changes) > > > > > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > > > is chosen by the kernel within the aperture(s). This is closer to > > > > how mmap() operates, and DPDK and similar shouldn't care about > > > > having specific IOVAs, even at the individual mapping level. > > > > > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s > MAP_FIXED, > > > > for when you really do want to control the IOVA (qemu, maybe some > > > > special userspace driver cases) > > > > > > We already did both of these, the flag is called > > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > > > select the IOVA internally. > > > > > > > - ATTACH will fail if the new device would shrink the aperture to > > > > exclude any already established mappings (I assume this is already > > > > the case) > > > > > > Yes > > > > > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > > > PROT_NONE mmap(). It reserves that IOVA space, so other (non- > FIXED) > > > > MAPs won't use it, but doesn't actually put anything into the IO > > > > pagetables. > > > > - Like a regular mapping, ATTACHes that are incompatible with an > > > > IOMAP_RESERVEed region will fail > > > > - An IOMAP_RESERVEed area can be overmapped with an > IOMAP_FIXED > > > > mapping > > > > > > Yeah, this seems OK, I'm thinking a new API might make sense because > > > you don't really want mmap replacement semantics but a permanent > > > record of what IOVA must always be valid. > > > > > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > > > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > > > > > struct iommu_ioas_require_iova { > > > __u32 size; > > > __u32 ioas_id; > > > __u32 num_iovas; > > > __u32 __reserved; > > > struct iommu_required_iovas { > > > __aligned_u64 start; > > > __aligned_u64 last; > > > } required_iovas[]; > > > }; > > > > As a permanent record do we want to enforce that once the required > > range list is set all FIXED and non-FIXED allocations must be within the > > list of ranges? > > No, I would just use this as a guarntee that going forward any > get_ranges will always return ranges that cover the listed required > ranges. Ie any narrowing of the ranges will be refused. > > map/unmap should only be restricted to the get_ranges output. > > Wouldn't burn CPU cycles to nanny userspace here. fair enough. > > > If yes we can take the end of the last range as the max size of the iova > > address space to optimize the page table layout. > > I think this API should not interact with the driver. Its only job is > to prevent devices from attaching that would narrow the ranges. > > If we also use it to adjust the aperture of the created iommu_domain > then it looses its usefullness as guard since something like qemu > would have to leave room for hotplug as well. > > I suppose optimizing the created iommu_domains should be some other > API, with a different set of ranges and the '# of bytes of IOVA' hint > as well. make sense. > > > > > For (unoptimized) qemu it would be: > > > > > > > > 1. Create IOAS > > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > > > the > > > > guest platform > > > > 3. ATTACH devices (this will fail if they're not compatible with the > > > > reserved IOVA regions) > > > > 4. Boot the guest > > > > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC > > vIOMMUs regular mappings are required before booting the guest and > > reservation might be done but not mandatory (at least not what current > > Qemu vfio can afford as it simply replays valid ranges in the CPU address > > space). > > I think qemu can always do it, it feels like it would simplify error > cases around aperture mismatches. > It could, but require more changes in Qemu to define required ranges in platform logic and then convey it from Qemu address space to VFIO. I view it as an optimization hence not necessarily to be done immediately. Thanks Kevin