On Fri, 7 Oct 2016 19:10:27 +0200 Auger Eric <eric.auger@xxxxxxxxxx> wrote: > Hi Alex, > > On 06/10/2016 22:42, Alex Williamson wrote: > > On Thu, 6 Oct 2016 14:20:40 -0600 > > Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: > > > >> On Thu, 6 Oct 2016 08:45:31 +0000 > >> Eric Auger <eric.auger@xxxxxxxxxx> wrote: > >> > >>> This patch allows the user-space to retrieve the MSI geometry. The > >>> implementation is based on capability chains, now also added to > >>> VFIO_IOMMU_GET_INFO. > >>> > >>> The returned info comprise: > >>> - whether the MSI IOVA are constrained to a reserved range (x86 case) and > >>> in the positive, the start/end of the aperture, > >>> - or whether the IOVA aperture need to be set by the userspace. In that > >>> case, the size and alignment of the IOVA window to be provided are > >>> returned. > >>> > >>> In case the userspace must provide the IOVA aperture, we currently report > >>> a size/alignment based on all the doorbells registered by the host kernel. > >>> This may exceed the actual needs. > >>> > >>> Signed-off-by: Eric Auger <eric.auger@xxxxxxxxxx> > >>> > >>> --- > >>> v11 -> v11: > >>> - msi_doorbell_pages was renamed msi_doorbell_calc_pages > >>> > >>> v9 -> v10: > >>> - move cap_offset after iova_pgsizes > >>> - replace __u64 alignment by __u32 order > >>> - introduce __u32 flags in vfio_iommu_type1_info_cap_msi_geometry and > >>> fix alignment > >>> - call msi-doorbell API to compute the size/alignment > >>> > >>> v8 -> v9: > >>> - use iommu_msi_supported flag instead of programmable > >>> - replace IOMMU_INFO_REQUIRE_MSI_MAP flag by a more sophisticated > >>> capability chain, reporting the MSI geometry > >>> > >>> v7 -> v8: > >>> - use iommu_domain_msi_geometry > >>> > >>> v6 -> v7: > >>> - remove the computation of the number of IOVA pages to be provisionned. > >>> This number depends on the domain/group/device topology which can > >>> dynamically change. Let's rely instead rely on an arbitrary max depending > >>> on the system > >>> > >>> v4 -> v5: > >>> - move msi_info and ret declaration within the conditional code > >>> > >>> v3 -> v4: > >>> - replace former vfio_domains_require_msi_mapping by > >>> more complex computation of MSI mapping requirements, especially the > >>> number of pages to be provided by the user-space. > >>> - reword patch title > >>> > >>> RFC v1 -> v1: > >>> - derived from > >>> [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state > >>> - renamed allow_msi_reconfig into require_msi_mapping > >>> - fixed VFIO_IOMMU_GET_INFO > >>> --- > >>> drivers/vfio/vfio_iommu_type1.c | 78 ++++++++++++++++++++++++++++++++++++++++- > >>> include/uapi/linux/vfio.h | 32 ++++++++++++++++- > >>> 2 files changed, 108 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >>> index dc3ee5d..ce5e7eb 100644 > >>> --- a/drivers/vfio/vfio_iommu_type1.c > >>> +++ b/drivers/vfio/vfio_iommu_type1.c > >>> @@ -38,6 +38,8 @@ > >>> #include <linux/workqueue.h> > >>> #include <linux/dma-iommu.h> > >>> #include <linux/msi-doorbell.h> > >>> +#include <linux/irqdomain.h> > >>> +#include <linux/msi.h> > >>> > >>> #define DRIVER_VERSION "0.2" > >>> #define DRIVER_AUTHOR "Alex Williamson <alex.williamson@xxxxxxxxxx>" > >>> @@ -1101,6 +1103,55 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu) > >>> return ret; > >>> } > >>> > >>> +static int compute_msi_geometry_caps(struct vfio_iommu *iommu, > >>> + struct vfio_info_cap *caps) > >>> +{ > >>> + struct vfio_iommu_type1_info_cap_msi_geometry *vfio_msi_geometry; > >>> + unsigned long order = __ffs(vfio_pgsize_bitmap(iommu)); > >>> + struct iommu_domain_msi_geometry msi_geometry; > >>> + struct vfio_info_cap_header *header; > >>> + struct vfio_domain *d; > >>> + bool reserved; > >>> + size_t size; > >>> + > >>> + mutex_lock(&iommu->lock); > >>> + /* All domains have same require_msi_map property, pick first */ > >>> + d = list_first_entry(&iommu->domain_list, struct vfio_domain, next); > >>> + iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_GEOMETRY, > >>> + &msi_geometry); > >>> + reserved = !msi_geometry.iommu_msi_supported; > >>> + > >>> + mutex_unlock(&iommu->lock); > >>> + > >>> + size = sizeof(*vfio_msi_geometry); > >>> + header = vfio_info_cap_add(caps, size, > >>> + VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY, 1); > >>> + > >>> + if (IS_ERR(header)) > >>> + return PTR_ERR(header); > >>> + > >>> + vfio_msi_geometry = container_of(header, > >>> + struct vfio_iommu_type1_info_cap_msi_geometry, > >>> + header); > >>> + > >>> + vfio_msi_geometry->flags = reserved; > >> > >> Use the bit flag VFIO_IOMMU_MSI_GEOMETRY_RESERVED > >> > >>> + if (reserved) { > >>> + vfio_msi_geometry->aperture_start = msi_geometry.aperture_start; > >>> + vfio_msi_geometry->aperture_end = msi_geometry.aperture_end; > >> > >> But maybe nobody has set these, did you intend to use > >> iommu_domain_msi_aperture_valid(), which you defined early on but never > >> used? > >> > >>> + return 0; > >>> + } > >>> + > >>> + vfio_msi_geometry->order = order; > >> > >> I'm tempted to suggest that a user could do the same math on their own > >> since we provide the supported bitmap already... could it ever not be > >> the same? > >> > >>> + /* > >>> + * we compute a system-wide requirement based on all the registered > >>> + * doorbells > >>> + */ > >>> + vfio_msi_geometry->size = > >>> + msi_doorbell_calc_pages(order) * ((uint64_t) 1 << order); > >>> + > >>> + return 0; > >>> +} > >>> + > >>> static long vfio_iommu_type1_ioctl(void *iommu_data, > >>> unsigned int cmd, unsigned long arg) > >>> { > >>> @@ -1122,8 +1173,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, > >>> } > >>> } else if (cmd == VFIO_IOMMU_GET_INFO) { > >>> struct vfio_iommu_type1_info info; > >>> + struct vfio_info_cap caps = { .buf = NULL, .size = 0 }; > >>> + int ret; > >>> > >>> - minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes); > >>> + minsz = offsetofend(struct vfio_iommu_type1_info, cap_offset); > >>> > >>> if (copy_from_user(&info, (void __user *)arg, minsz)) > >>> return -EFAULT; > >>> @@ -1135,6 +1188,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, > >>> > >>> info.iova_pgsizes = vfio_pgsize_bitmap(iommu); > >>> > >>> + ret = compute_msi_geometry_caps(iommu, &caps); > >>> + if (ret) > >>> + return ret; > >>> + > >>> + if (caps.size) { > >>> + info.flags |= VFIO_IOMMU_INFO_CAPS; > >>> + if (info.argsz < sizeof(info) + caps.size) { > >>> + info.argsz = sizeof(info) + caps.size; > >>> + info.cap_offset = 0; > >>> + } else { > >>> + vfio_info_cap_shift(&caps, sizeof(info)); > >>> + if (copy_to_user((void __user *)arg + > >>> + sizeof(info), caps.buf, > >>> + caps.size)) { > >>> + kfree(caps.buf); > >>> + return -EFAULT; > >>> + } > >>> + info.cap_offset = sizeof(info); > >>> + } > >>> + > >>> + kfree(caps.buf); > >>> + } > >>> + > >>> return copy_to_user((void __user *)arg, &info, minsz) ? > >>> -EFAULT : 0; > >>> > >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > >>> index 4a9dbc2..8dae013 100644 > >>> --- a/include/uapi/linux/vfio.h > >>> +++ b/include/uapi/linux/vfio.h > >>> @@ -488,7 +488,35 @@ struct vfio_iommu_type1_info { > >>> __u32 argsz; > >>> __u32 flags; > >>> #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */ > >>> - __u64 iova_pgsizes; /* Bitmap of supported page sizes */ > >>> +#define VFIO_IOMMU_INFO_CAPS (1 << 1) /* Info supports caps */ > >>> + __u64 iova_pgsizes; /* Bitmap of supported page sizes */ > >>> + __u32 __resv; > >>> + __u32 cap_offset; /* Offset within info struct of first cap */ > >>> +}; > >> > >> I understand the padding, but not the ordering. Why not end with > >> padding? > >> > >>> + > >>> +#define VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY 1 > >>> + > >>> +/* > >>> + * The MSI geometry capability allows to report the MSI IOVA geometry: > >>> + * - either the MSI IOVAs are constrained within a reserved IOVA aperture > >>> + * whose boundaries are given by [@aperture_start, @aperture_end]. > >>> + * this is typically the case on x86 host. The userspace is not allowed > >>> + * to map userspace memory at IOVAs intersecting this range using > >>> + * VFIO_IOMMU_MAP_DMA. > >>> + * - or the MSI IOVAs are not requested to belong to any reserved range; > >>> + * in that case the userspace must provide an IOVA window characterized by > >>> + * @size and @alignment using VFIO_IOMMU_MAP_DMA with RESERVED_MSI_IOVA flag. > >>> + */ > >>> +struct vfio_iommu_type1_info_cap_msi_geometry { > >>> + struct vfio_info_cap_header header; > >>> + __u32 flags; > >>> +#define VFIO_IOMMU_MSI_GEOMETRY_RESERVED (1 << 0) /* reserved geometry */ > >>> + /* not reserved */ > >>> + __u32 order; /* iommu page order used for aperture alignment*/ > >>> + __u64 size; /* IOVA aperture size (bytes) the userspace must provide */ > >>> + /* reserved */ > >>> + __u64 aperture_start; > >>> + __u64 aperture_end; > >> > >> Should these be a union? We never set them both. Should the !reserved > >> case have a flag as well, so the user can positively identify what's > >> being provided? > > > > Actually, is there really any need to fit both of these within the same > > structure? Part of the idea of the capability chains is we can create > > a capability for each new thing we want to describe. So, we could > > simply define a generic reserved IOVA range capability with a 'start' > > and 'end' and then another capability to define MSI mapping > > requirements. Thanks, > Yes your suggested approach makes sense to me. > > One reason why I proceeded that way is we are mixing things at iommu.h > level too. Personally I would have preferred to separate things: > 1) add a new IOMMU_CAP_TRANSLATE_MSI capability in iommu_cap > 2) rename iommu_msi_supported into "programmable" bool: reporting > whether the aperture is reserved or programmable. > > In the early releases I think it was as above but slightly we moved to a > mixed description. > > What do you think? The API certainly doesn't seem like it has a cohesive feel to me. It's not entirely clear to me how we know when we need to register a DMA MSI cookie, or how we know that the MSI doorbell API is actually initialized and in use by the MSI/IOMMU layer, or exactly what is the MSI geometry telling me. Perhaps this is why the code doesn't seem to have a good rejection mechanism for architectures that need it versus those that don't, it's too hard to tell. Maybe we can look at what we think the user API should be and work backwards. For x86 we simply have a reserved range of IOVA. I'm not entirely sure it adds to the user API to know that it's for MSI, it's just a range of IOVAs that we cannot allocate for regular DMA. In fact, we currently lack a mechanism for describing the IOVA space of the IOMMU at all, so rather than focusing on a mechanism to describe a hole in the IOVA space, we might simply want to focus on a mechanism to describe the available IOVA space. Everybody needs that, not just x86. That sort of sounds like a VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE that perhaps looks like: struct vfio_iommu_type1_info_cap_iova_range { struct vfio_info_cap_header header; u64 start; u64 end; }; Clearly we need to allow multiple of these in the capability chain since the existing x86 MSI range bisects this address space. To support this, we basically need the same information from the IOMMU API. We already have DOMAIN_ATTR_GEOMETRY, which should give us the base IOVA range, but we don't have anything describing the gaps. We don't know how many sources of gaps we'll have in the future, but let's keep it simple and assume we can look for MSI gaps and add other possible sources of gaps in the future, it's an internal API after all. So we can use DOMAIN_ATTR_MSI_GEOMETRY to tell us about the (we assume one) MSI range of reserved IOVA within DOMAIN_ATTR_GEOMETRY. For x86 this is fixed, for SMMU this is a zero range until someone programs it. Now, what does a user need to know to add a reserved MSI IOVA range? They need to know a) that it needs to be done, and b) how big to make it (and maybe alignment requirements). Really all we need to describe then is b) since b) implies a). So maybe that gives us another capability chain entry: struct vfio_iommu_type1_info_cap_msi_resv { struct vfio_info_cap_header header; u64 size; u64 alignment; }; It doesn't seem like we need to waste a flag bit on vfio_iommu_type1_info.flags for this since the existence of this capability would imply that VFIO_IOMMU_MAP_DMA supports an MSI_RESV flag. So what do we need from the kernel infrastructure to make that happen? Well, we need a) and b) above, and again b) can imply a), so if the IOMMU API provided a DOMAIN_ATTR_MSI_RESV, providing the same size/alignment, then we're nearly there. Then we just need a way to set that range, which I'd probably try to plumb through the IOMMU API rather than pulling in separate doorbell APIs and DMA cookie APIs. If it's going to pull together all those different things, let's at least only do that in one place so we can expose a consistent API through the IOMMU API. Obviously once a range is set, DOMAIN_ATTR_MSI_RESV should report that range, so if the user were to look at the type1 info capability chain again, the available IOVA ranges would reflect the now reserved range. Maybe that's more than you're asking for, but that's the approach I would take to solidify the API. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html