RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Wed, 11 May 2022 23:23:22 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, May 12, 2022 12:32 AM
> 
> On Wed, May 11, 2022 at 03:15:22AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Wednesday, May 11, 2022 3:00 AM
> > >
> > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote:
> > > > Ok... here's a revised version of my proposal which I think addresses
> > > > your concerns and simplfies things.
> > > >
> > > > - No new operations, but IOAS_MAP gets some new flags (and
> IOAS_COPY
> > > >   will probably need matching changes)
> > > >
> > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA
> > > >   is chosen by the kernel within the aperture(s).  This is closer to
> > > >   how mmap() operates, and DPDK and similar shouldn't care about
> > > >   having specific IOVAs, even at the individual mapping level.
> > > >
> > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s
> MAP_FIXED,
> > > >   for when you really do want to control the IOVA (qemu, maybe some
> > > >   special userspace driver cases)
> > >
> > > We already did both of these, the flag is called
> > > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will
> > > select the IOVA internally.
> > >
> > > > - ATTACH will fail if the new device would shrink the aperture to
> > > >   exclude any already established mappings (I assume this is already
> > > >   the case)
> > >
> > > Yes
> > >
> > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a
> > > >   PROT_NONE mmap().  It reserves that IOVA space, so other (non-
> FIXED)
> > > >   MAPs won't use it, but doesn't actually put anything into the IO
> > > >   pagetables.
> > > >     - Like a regular mapping, ATTACHes that are incompatible with an
> > > >       IOMAP_RESERVEed region will fail
> > > >     - An IOMAP_RESERVEed area can be overmapped with an
> IOMAP_FIXED
> > > >       mapping
> > >
> > > Yeah, this seems OK, I'm thinking a new API might make sense because
> > > you don't really want mmap replacement semantics but a permanent
> > > record of what IOVA must always be valid.
> > >
> > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to
> > > IOMMUFD_CMD_IOAS_IOVA_RANGES:
> > >
> > > struct iommu_ioas_require_iova {
> > >         __u32 size;
> > >         __u32 ioas_id;
> > >         __u32 num_iovas;
> > >         __u32 __reserved;
> > >         struct iommu_required_iovas {
> > >                 __aligned_u64 start;
> > >                 __aligned_u64 last;
> > >         } required_iovas[];
> > > };
> >
> > As a permanent record do we want to enforce that once the required
> > range list is set all FIXED and non-FIXED allocations must be within the
> > list of ranges?
> 
> No, I would just use this as a guarntee that going forward any
> get_ranges will always return ranges that cover the listed required
> ranges. Ie any narrowing of the ranges will be refused.
> 
> map/unmap should only be restricted to the get_ranges output.
> 
> Wouldn't burn CPU cycles to nanny userspace here.

fair enough.

> 
> > If yes we can take the end of the last range as the max size of the iova
> > address space to optimize the page table layout.
> 
> I think this API should not interact with the driver. Its only job is
> to prevent devices from attaching that would narrow the ranges.
> 
> If we also use it to adjust the aperture of the created iommu_domain
> then it looses its usefullness as guard since something like qemu
> would have to leave room for hotplug as well.
> 
> I suppose optimizing the created iommu_domains should be some other
> API, with a different set of ranges and the '# of bytes of IOVA' hint
> as well.

make sense.

> 
> > > > For (unoptimized) qemu it would be:
> > > >
> > > > 1. Create IOAS
> > > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of
> > > the
> > > >    guest platform
> > > > 3. ATTACH devices (this will fail if they're not compatible with the
> > > >    reserved IOVA regions)
> > > > 4. Boot the guest
> >
> > I suppose above is only the sample flow for PPC vIOMMU. For non-PPC
> > vIOMMUs regular mappings are required before booting the guest and
> > reservation might be done but not mandatory (at least not what current
> > Qemu vfio can afford as it simply replays valid ranges in the CPU address
> > space).
> 
> I think qemu can always do it, it feels like it would simplify error
> cases around aperture mismatches.
> 

It could, but require more changes in Qemu to define required ranges
in platform logic and then convey it from Qemu address space to VFIO.
I view it as an optimization hence not necessarily to be done immediately.

Thanks
Kevin