> From: Jason Gunthorpe <jgg@xxxxxxxxxx> > Sent: Wednesday, September 22, 2021 10:09 PM > > On Wed, Sep 22, 2021 at 03:40:25AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx> > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > allocating an IOASID, userspace is expected to specify the type and > > > > format information for the target I/O page table. > > > > > > > > This RFC supports only one type > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > semantics. For this type the user should specify the addr_width of > > > > the I/O address space and whether the I/O page table is created in > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > point, > > > > as the false setting requires additional contract with KVM on handling > > > > WBINVD emulation, which can be added later. > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next > patch) > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > Open: > > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > > Per previous discussion they can also use vfio type1v2 as long as there > > > > is a way to claim a specific iova range from a system-wide address > space. > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > hasn't > > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > > and then decide how to incorporate it in v2. > > > > > > I think the request was to include a start/end IO address hint when > > > creating the ios. When the kernel creates it then it can return the > > > > is the hint single-range or could be multiple-ranges? > > David explained it here: > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > bit range. > > So a 'range hint' will do the job > > David also suggested this: > > https://lore.kernel.org/kvm/YL6%2FbjHyuHJTn4Rd@yekko/ > > So I like this better: > > struct iommu_ioasid_alloc { > __u32 argsz; > > __u32 flags; > #define IOMMU_IOASID_ENFORCE_SNOOP (1 << 0) > #define IOMMU_IOASID_HINT_BASE_IOVA (1 << 1) > > __aligned_u64 max_iova_hint; > __aligned_u64 base_iova_hint; // Used only if > IOMMU_IOASID_HINT_BASE_IOVA > > // For creating nested page tables > __u32 parent_ios_id; > __u32 format; > #define IOMMU_FORMAT_KERNEL 0 > #define IOMMU_FORMAT_PPC_XXX 2 > #define IOMMU_FORMAT_[..] > u32 format_flags; // Layout depends on format above > > __aligned_u64 user_page_directory; // Used if parent_ios_id != 0 > }; > > Again 'type' as an overall API indicator should not exist, feature > flags need to have clear narrow meanings. currently the type is aimed to differentiate three usages: - kernel-managed I/O page table - user-managed I/O page table - shared I/O page table (e.g. with mm, or ept) we can remove 'type', but is FORMAT_KENREL/USER/SHARED a good indicator? their difference is not about format. > > This does both of David's suggestions at once. If quemu wants the 1G > limited region it could specify max_iova_hint = 1G, if it wants the > extend 64bit region with the hole it can give either the high base or > a large max_iova_hint. format/format_flags allows a further Dave's links didn't answer one puzzle from me. Does PPC needs accurate range information or be ok with a large range including holes (then let the kernel to figure out where the holes locate)? > device-specific escape if more specific customization is needed and is > needed to specify user space page tables anyhow. and I didn't understand the 2nd link. How does user-managed page table jump into this range claim problem? I'm getting confused... > > > > ioas works well here I think. Use ioas_id to refer to the xarray > > > index. > > > > What about when introducing pasid to this uAPI? Then use ioas_id > > for the xarray index > > Yes, ioas_id should always be the xarray index. > > PASID needs to be called out as PASID or as a generic "hw description" > blob. ARM doesn't use PASID. So we need a generic blob, e.g. ioas_hwid? and still we have both ioas_id (iommufd) and ioasid (ioasid.c) in the kernel. Do we want to clear this confusion? Or possibly it's fine because ioas_id is never used outside of iommufd and iommufd doesn't directly call ioasid_alloc() from ioasid.c? > > kvm's API to program the vPASID translation table should probably take > in a (iommufd,ioas_id,device_id) tuple and extract the IOMMU side > information using an in-kernel API. Userspace shouldn't have to > shuttle it around. the vPASID info is carried in VFIO_DEVICE_ATTACH_IOASID uAPI. when kvm calls iommufd with above tuple, vPASID->pPASID is returned to kvm. So we still need a generic blob to represent vPASID in the uAPI. > > I'm starting to feel like the struct approach for describing this uAPI > might not scale well, but lets see.. > > Jason