Re: [RFC 14/20] iommu/iommufd: Add iommufd_device_[de]attach_ioasid()

Jason Gunthorpe <jgg@xxxxxxxxxx> · Tue, 21 Sep 2021 15:02:03 -0300

On Sun, Sep 19, 2021 at 02:38:42PM +0800, Liu Yi L wrote:
> An I/O address space takes effect in the iommu only after it's attached
> by a device. This patch provides iommufd_device_[de/at]tach_ioasid()
> helpers for this purpose. One device can be only attached to one ioasid
> at this point, but one ioasid can be attached by multiple devices.
> 
> The caller specifies the iommufd_device (returned at binding time) and
> the target ioasid when calling the helper function. Upon request, iommufd
> installs the specified I/O page table to the correct place in the IOMMU,
> according to the routing information (struct device* which represents
> RID) recorded in iommufd_device. Future variants could allow the caller
> to specify additional routing information (e.g. pasid/ssid) when multiple
> I/O address spaces are supported per device.
> 
> Open:
> Per Jason's comment in below link, bus-specific wrappers are recommended.
> This RFC implements one wrapper for pci device. But it looks that struct
> pci_device is not used at all since iommufd_ device already carries all
> necessary info. So want to have another discussion on its necessity, e.g.
> whether making more sense to have bus-specific wrappers for binding, while
> leaving a common attaching helper per iommufd_device.
> https://lore.kernel.org/linux-iommu/20210528233649.GB3816344@xxxxxxxxxx/
> 
> TODO:
> When multiple devices are attached to a same ioasid, the permitted iova
> ranges and supported pgsize bitmap on this ioasid should be a common
> subset of all attached devices. iommufd needs to track such info per
> ioasid and update it every time when a new device is attached to the
> ioasid. This has not been done in this version yet, due to the temporary
> hack adopted in patch 16-18. The hack reuses vfio type1 driver which
> already includes the necessary logic for iova ranges and pgsize bitmap.
> Once we get a clear direction for those patches, that logic will be moved
> to this patch.
> 
> Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
>  drivers/iommu/iommufd/iommufd.c | 226 ++++++++++++++++++++++++++++++++
>  include/linux/iommufd.h         |  29 ++++
>  2 files changed, 255 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/iommufd.c b/drivers/iommu/iommufd/iommufd.c
> index e45d76359e34..25373a0e037a 100644
> +++ b/drivers/iommu/iommufd/iommufd.c
> @@ -51,6 +51,19 @@ struct iommufd_ioas {
>  	bool enforce_snoop;
>  	struct iommufd_ctx *ictx;
>  	refcount_t refs;
> +	struct mutex lock;
> +	struct list_head device_list;
> +	struct iommu_domain *domain;

This should just be another xarray indexed by the device id

> +/* Caller should hold ioas->lock */
> +static struct ioas_device_info *ioas_find_device(struct iommufd_ioas *ioas,
> +						 struct iommufd_device *idev)
> +{
> +	struct ioas_device_info *ioas_dev;
> +
> +	list_for_each_entry(ioas_dev, &ioas->device_list, next) {
> +		if (ioas_dev->idev == idev)
> +			return ioas_dev;
> +	}

Which eliminates this search. xarray with tightly packed indexes is
generally more efficient than linked lists..

> +static int ioas_check_device_compatibility(struct iommufd_ioas *ioas,
> +					   struct device *dev)
> +{
> +	bool snoop = false;
> +	u32 addr_width;
> +	int ret;
> +
> +	/*
> +	 * currently we only support I/O page table with iommu enforce-snoop
> +	 * format. Attaching a device which doesn't support this format in its
> +	 * upstreaming iommu is rejected.
> +	 */
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_FORCE_SNOOP, &snoop);
> +	if (ret || !snoop)
> +		return -EINVAL;
> +
> +	ret = iommu_device_get_info(dev, IOMMU_DEV_INFO_ADDR_WIDTH, &addr_width);
> +	if (ret || addr_width < ioas->addr_width)
> +		return -EINVAL;
> +
> +	/* TODO: also need to check permitted iova ranges and pgsize bitmap */
> +
> +	return 0;
> +}

This seems kind of weird..

I expect the iommufd to hold a SW copy of the IO page table and each
time a new domain is to be created it should push the SW copy into the
domain. If the domain cannot support it then the domain driver should
naturally fail a request.

When the user changes the IO page table the SW copy is updated then
all of the domains are updated too - again if any domain cannot
support the change then it fails and the change is rolled back.

It seems like this is a side effect of roughly hacking in the vfio
code?

> +
> +/**
> + * iommufd_device_attach_ioasid - attach device to an ioasid
> + * @idev: [in] Pointer to struct iommufd_device.
> + * @ioasid: [in] ioasid points to an I/O address space.
> + *
> + * Returns 0 for successful attach, otherwise returns error.
> + *
> + */
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid)

Types for the ioas_id again..

> +{
> +	struct iommufd_ioas *ioas;
> +	struct ioas_device_info *ioas_dev;
> +	struct iommu_domain *domain;
> +	int ret;
> +
> +	ioas = ioasid_get_ioas(idev->ictx, ioasid);
> +	if (!ioas) {
> +		pr_err_ratelimited("Trying to attach illegal or unkonwn IOASID %u\n", ioasid);
> +		return -EINVAL;

No prints triggered by bad userspace

> +	}
> +
> +	mutex_lock(&ioas->lock);
> +
> +	/* Check for duplicates */
> +	if (ioas_find_device(ioas, idev)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}

just xa_cmpxchg NULL, XA_ZERO_ENTRY

> +	/*
> +	 * Each ioas is backed by an iommu domain, which is allocated
> +	 * when the ioas is attached for the first time and then shared
> +	 * by following devices.
> +	 */
> +	if (list_empty(&ioas->device_list)) {

Seems strange, what if the devices are forced to have different
domains? We don't want to model that in the SW layer..

> +	/* Install the I/O page table to the iommu for this device */
> +	ret = iommu_attach_device(domain, idev->dev);
> +	if (ret)
> +		goto out_domain;

This is where things start to get confusing when you talk about PASID
as the above call needs to be some PASID centric API.

> @@ -27,6 +28,16 @@ struct iommufd_device *
>  iommufd_bind_device(int fd, struct device *dev, u64 dev_cookie);
>  void iommufd_unbind_device(struct iommufd_device *idev);
>  
> +int iommufd_device_attach_ioasid(struct iommufd_device *idev, int ioasid);
> +void iommufd_device_detach_ioasid(struct iommufd_device *idev, int ioasid);
> +
> +static inline int
> +__pci_iommufd_device_attach_ioasid(struct pci_dev *pdev,
> +				   struct iommufd_device *idev, int ioasid)
> +{
> +	return iommufd_device_attach_ioasid(idev, ioasid);
> +}

If think sis taking in the iommfd_device then there isn't a logical
place to signal the PCIness

But, I think the API should at least have a kdoc that this is
capturing the entire device and specify that for PCI this means all
TLPs with the RID.

Jason