Hi Alex, On 07/10/2016 22:38, Alex Williamson wrote: > On Fri, 7 Oct 2016 19:10:27 +0200 > Auger Eric <eric.auger@xxxxxxxxxx> wrote: > >> Hi Alex, >> >> On 06/10/2016 22:42, Alex Williamson wrote: >>> On Thu, 6 Oct 2016 14:20:40 -0600 >>> Alex Williamson <alex.williamson@xxxxxxxxxx> wrote: >>> >>>> On Thu, 6 Oct 2016 08:45:31 +0000 >>>> Eric Auger <eric.auger@xxxxxxxxxx> wrote: >>>> >>>>> This patch allows the user-space to retrieve the MSI geometry. The >>>>> implementation is based on capability chains, now also added to >>>>> VFIO_IOMMU_GET_INFO. >>>>> >>>>> The returned info comprise: >>>>> - whether the MSI IOVA are constrained to a reserved range (x86 case) and >>>>> in the positive, the start/end of the aperture, >>>>> - or whether the IOVA aperture need to be set by the userspace. In that >>>>> case, the size and alignment of the IOVA window to be provided are >>>>> returned. >>>>> >>>>> In case the userspace must provide the IOVA aperture, we currently report >>>>> a size/alignment based on all the doorbells registered by the host kernel. >>>>> This may exceed the actual needs. >>>>> >>>>> Signed-off-by: Eric Auger <eric.auger@xxxxxxxxxx> >>>>> >>>>> --- >>>>> v11 -> v11: >>>>> - msi_doorbell_pages was renamed msi_doorbell_calc_pages >>>>> >>>>> v9 -> v10: >>>>> - move cap_offset after iova_pgsizes >>>>> - replace __u64 alignment by __u32 order >>>>> - introduce __u32 flags in vfio_iommu_type1_info_cap_msi_geometry and >>>>> fix alignment >>>>> - call msi-doorbell API to compute the size/alignment >>>>> >>>>> v8 -> v9: >>>>> - use iommu_msi_supported flag instead of programmable >>>>> - replace IOMMU_INFO_REQUIRE_MSI_MAP flag by a more sophisticated >>>>> capability chain, reporting the MSI geometry >>>>> >>>>> v7 -> v8: >>>>> - use iommu_domain_msi_geometry >>>>> >>>>> v6 -> v7: >>>>> - remove the computation of the number of IOVA pages to be provisionned. >>>>> This number depends on the domain/group/device topology which can >>>>> dynamically change. Let's rely instead rely on an arbitrary max depending >>>>> on the system >>>>> >>>>> v4 -> v5: >>>>> - move msi_info and ret declaration within the conditional code >>>>> >>>>> v3 -> v4: >>>>> - replace former vfio_domains_require_msi_mapping by >>>>> more complex computation of MSI mapping requirements, especially the >>>>> number of pages to be provided by the user-space. >>>>> - reword patch title >>>>> >>>>> RFC v1 -> v1: >>>>> - derived from >>>>> [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state >>>>> - renamed allow_msi_reconfig into require_msi_mapping >>>>> - fixed VFIO_IOMMU_GET_INFO >>>>> --- >>>>> drivers/vfio/vfio_iommu_type1.c | 78 ++++++++++++++++++++++++++++++++++++++++- >>>>> include/uapi/linux/vfio.h | 32 ++++++++++++++++- >>>>> 2 files changed, 108 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>> index dc3ee5d..ce5e7eb 100644 >>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>> @@ -38,6 +38,8 @@ >>>>> #include <linux/workqueue.h> >>>>> #include <linux/dma-iommu.h> >>>>> #include <linux/msi-doorbell.h> >>>>> +#include <linux/irqdomain.h> >>>>> +#include <linux/msi.h> >>>>> >>>>> #define DRIVER_VERSION "0.2" >>>>> #define DRIVER_AUTHOR "Alex Williamson <alex.williamson@xxxxxxxxxx>" >>>>> @@ -1101,6 +1103,55 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu) >>>>> return ret; >>>>> } >>>>> >>>>> +static int compute_msi_geometry_caps(struct vfio_iommu *iommu, >>>>> + struct vfio_info_cap *caps) >>>>> +{ >>>>> + struct vfio_iommu_type1_info_cap_msi_geometry *vfio_msi_geometry; >>>>> + unsigned long order = __ffs(vfio_pgsize_bitmap(iommu)); >>>>> + struct iommu_domain_msi_geometry msi_geometry; >>>>> + struct vfio_info_cap_header *header; >>>>> + struct vfio_domain *d; >>>>> + bool reserved; >>>>> + size_t size; >>>>> + >>>>> + mutex_lock(&iommu->lock); >>>>> + /* All domains have same require_msi_map property, pick first */ >>>>> + d = list_first_entry(&iommu->domain_list, struct vfio_domain, next); >>>>> + iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_GEOMETRY, >>>>> + &msi_geometry); >>>>> + reserved = !msi_geometry.iommu_msi_supported; >>>>> + >>>>> + mutex_unlock(&iommu->lock); >>>>> + >>>>> + size = sizeof(*vfio_msi_geometry); >>>>> + header = vfio_info_cap_add(caps, size, >>>>> + VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY, 1); >>>>> + >>>>> + if (IS_ERR(header)) >>>>> + return PTR_ERR(header); >>>>> + >>>>> + vfio_msi_geometry = container_of(header, >>>>> + struct vfio_iommu_type1_info_cap_msi_geometry, >>>>> + header); >>>>> + >>>>> + vfio_msi_geometry->flags = reserved; >>>> >>>> Use the bit flag VFIO_IOMMU_MSI_GEOMETRY_RESERVED >>>> >>>>> + if (reserved) { >>>>> + vfio_msi_geometry->aperture_start = msi_geometry.aperture_start; >>>>> + vfio_msi_geometry->aperture_end = msi_geometry.aperture_end; >>>> >>>> But maybe nobody has set these, did you intend to use >>>> iommu_domain_msi_aperture_valid(), which you defined early on but never >>>> used? >>>> >>>>> + return 0; >>>>> + } >>>>> + >>>>> + vfio_msi_geometry->order = order; >>>> >>>> I'm tempted to suggest that a user could do the same math on their own >>>> since we provide the supported bitmap already... could it ever not be >>>> the same? >>>> >>>>> + /* >>>>> + * we compute a system-wide requirement based on all the registered >>>>> + * doorbells >>>>> + */ >>>>> + vfio_msi_geometry->size = >>>>> + msi_doorbell_calc_pages(order) * ((uint64_t) 1 << order); >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> static long vfio_iommu_type1_ioctl(void *iommu_data, >>>>> unsigned int cmd, unsigned long arg) >>>>> { >>>>> @@ -1122,8 +1173,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, >>>>> } >>>>> } else if (cmd == VFIO_IOMMU_GET_INFO) { >>>>> struct vfio_iommu_type1_info info; >>>>> + struct vfio_info_cap caps = { .buf = NULL, .size = 0 }; >>>>> + int ret; >>>>> >>>>> - minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes); >>>>> + minsz = offsetofend(struct vfio_iommu_type1_info, cap_offset); >>>>> >>>>> if (copy_from_user(&info, (void __user *)arg, minsz)) >>>>> return -EFAULT; >>>>> @@ -1135,6 +1188,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data, >>>>> >>>>> info.iova_pgsizes = vfio_pgsize_bitmap(iommu); >>>>> >>>>> + ret = compute_msi_geometry_caps(iommu, &caps); >>>>> + if (ret) >>>>> + return ret; >>>>> + >>>>> + if (caps.size) { >>>>> + info.flags |= VFIO_IOMMU_INFO_CAPS; >>>>> + if (info.argsz < sizeof(info) + caps.size) { >>>>> + info.argsz = sizeof(info) + caps.size; >>>>> + info.cap_offset = 0; >>>>> + } else { >>>>> + vfio_info_cap_shift(&caps, sizeof(info)); >>>>> + if (copy_to_user((void __user *)arg + >>>>> + sizeof(info), caps.buf, >>>>> + caps.size)) { >>>>> + kfree(caps.buf); >>>>> + return -EFAULT; >>>>> + } >>>>> + info.cap_offset = sizeof(info); >>>>> + } >>>>> + >>>>> + kfree(caps.buf); >>>>> + } >>>>> + >>>>> return copy_to_user((void __user *)arg, &info, minsz) ? >>>>> -EFAULT : 0; >>>>> >>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >>>>> index 4a9dbc2..8dae013 100644 >>>>> --- a/include/uapi/linux/vfio.h >>>>> +++ b/include/uapi/linux/vfio.h >>>>> @@ -488,7 +488,35 @@ struct vfio_iommu_type1_info { >>>>> __u32 argsz; >>>>> __u32 flags; >>>>> #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */ >>>>> - __u64 iova_pgsizes; /* Bitmap of supported page sizes */ >>>>> +#define VFIO_IOMMU_INFO_CAPS (1 << 1) /* Info supports caps */ >>>>> + __u64 iova_pgsizes; /* Bitmap of supported page sizes */ >>>>> + __u32 __resv; >>>>> + __u32 cap_offset; /* Offset within info struct of first cap */ >>>>> +}; >>>> >>>> I understand the padding, but not the ordering. Why not end with >>>> padding? >>>> >>>>> + >>>>> +#define VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY 1 >>>>> + >>>>> +/* >>>>> + * The MSI geometry capability allows to report the MSI IOVA geometry: >>>>> + * - either the MSI IOVAs are constrained within a reserved IOVA aperture >>>>> + * whose boundaries are given by [@aperture_start, @aperture_end]. >>>>> + * this is typically the case on x86 host. The userspace is not allowed >>>>> + * to map userspace memory at IOVAs intersecting this range using >>>>> + * VFIO_IOMMU_MAP_DMA. >>>>> + * - or the MSI IOVAs are not requested to belong to any reserved range; >>>>> + * in that case the userspace must provide an IOVA window characterized by >>>>> + * @size and @alignment using VFIO_IOMMU_MAP_DMA with RESERVED_MSI_IOVA flag. >>>>> + */ >>>>> +struct vfio_iommu_type1_info_cap_msi_geometry { >>>>> + struct vfio_info_cap_header header; >>>>> + __u32 flags; >>>>> +#define VFIO_IOMMU_MSI_GEOMETRY_RESERVED (1 << 0) /* reserved geometry */ >>>>> + /* not reserved */ >>>>> + __u32 order; /* iommu page order used for aperture alignment*/ >>>>> + __u64 size; /* IOVA aperture size (bytes) the userspace must provide */ >>>>> + /* reserved */ >>>>> + __u64 aperture_start; >>>>> + __u64 aperture_end; >>>> >>>> Should these be a union? We never set them both. Should the !reserved >>>> case have a flag as well, so the user can positively identify what's >>>> being provided? >>> >>> Actually, is there really any need to fit both of these within the same >>> structure? Part of the idea of the capability chains is we can create >>> a capability for each new thing we want to describe. So, we could >>> simply define a generic reserved IOVA range capability with a 'start' >>> and 'end' and then another capability to define MSI mapping >>> requirements. Thanks, >> Yes your suggested approach makes sense to me. >> >> One reason why I proceeded that way is we are mixing things at iommu.h >> level too. Personally I would have preferred to separate things: >> 1) add a new IOMMU_CAP_TRANSLATE_MSI capability in iommu_cap >> 2) rename iommu_msi_supported into "programmable" bool: reporting >> whether the aperture is reserved or programmable. >> >> In the early releases I think it was as above but slightly we moved to a >> mixed description. >> >> What do you think? > > The API certainly doesn't seem like it has a cohesive feel to me. It's > not entirely clear to me how we know when we need to register a DMA MSI > cookie, or how we know that the MSI doorbell API is actually > initialized and in use by the MSI/IOMMU layer, or exactly what is the > MSI geometry telling me. Perhaps this is why the code doesn't seem to > have a good rejection mechanism for architectures that need it versus > those that don't, it's too hard to tell. > > Maybe we can look at what we think the user API should be and work > backwards. For x86 we simply have a reserved range of IOVA. I'm not > entirely sure it adds to the user API to know that it's for MSI, it's > just a range of IOVAs that we cannot allocate for regular DMA. In > fact, we currently lack a mechanism for describing the IOVA space of > the IOMMU at all, so rather than focusing on a mechanism to describe a > hole in the IOVA space, we might simply want to focus on a mechanism to > describe the available IOVA space. Everybody needs that, not just > x86. That sort of sounds like a VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE > that perhaps looks like: > > struct vfio_iommu_type1_info_cap_iova_range { > struct vfio_info_cap_header header; > u64 start; > u64 end; > }; > > Clearly we need to allow multiple of these in the capability chain > since the existing x86 MSI range bisects this address space. > > To support this, we basically need the same information from the IOMMU > API. We already have DOMAIN_ATTR_GEOMETRY, which should give us the > base IOVA range, but we don't have anything describing the gaps. We > don't know how many sources of gaps we'll have in the future, but let's > keep it simple and assume we can look for MSI gaps and add other > possible sources of gaps in the future, it's an internal API after all. > So we can use DOMAIN_ATTR_MSI_GEOMETRY to tell us about the (we assume > one) MSI range of reserved IOVA within DOMAIN_ATTR_GEOMETRY. For x86 > this is fixed, for SMMU this is a zero range until someone programs it. > > Now, what does a user need to know to add a reserved MSI IOVA range? > They need to know a) that it needs to be done, and b) how big to make > it (and maybe alignment requirements). Really all we need to describe > then is b) since b) implies a). So maybe that gives us another > capability chain entry: > > struct vfio_iommu_type1_info_cap_msi_resv { > struct vfio_info_cap_header header; > u64 size; > u64 alignment; > }; I like the approach and I like the idea to separate the 2 issues in separate structs, both at VFIO level and IOMMU level. It makes even more sense now we have the other requirement to handle host PCIe host bridge window. > > It doesn't seem like we need to waste a flag bit on > vfio_iommu_type1_info.flags for this since the existence of this > capability would imply that VFIO_IOMMU_MAP_DMA supports an MSI_RESV > flag. I agree. > > So what do we need from the kernel infrastructure to make that happen? > Well, we need a) and b) above, and again b) can imply a), so if the > IOMMU API provided a DOMAIN_ATTR_MSI_RESV, providing the same > size/alignment, then we're nearly there. Agreed Then we just need a way to > set that range, which I'd probably try to plumb through the IOMMU API > rather than pulling in separate doorbell APIs and DMA cookie APIs. If > it's going to pull together all those different things, let's at least > only do that in one place so we can expose a consistent API through the > IOMMU API. Obviously once a range is set, DOMAIN_ATTR_MSI_RESV should > report that range, so if the user were to look at the type1 info > capability chain again, the available IOVA ranges would reflect the now > reserved range. So my plan is to respin the passthrough series with vfio_iommu_type1_info_cap_msi_resv and associated iommu struct. I would prefer to send a separate series to report IOVA usable address space. Thanks Eric > > Maybe that's more than you're asking for, but that's the approach I > would take to solidify the API. Thanks, > > Alex > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html