On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote: > On 2022-09-08 22:43, Jason Gunthorpe wrote: > > On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote: > > > > > Oh, because s390 is using iommu_get_domain_for_dev() in its release_device > > > callback, which needs to dereference the group to work, and the current > > > domain may also be a non-default one which we can't prevent from > > > disappearing racily, that was why :( > > > > Hum, the issue there is the use of device->iommu_group - but that just > > means I didn't split properly. How about this incremental: > > That did cross my mind, but it's a bit grim. Actually, also in my morning, I think it may not even be necessary. Keep in mind the start of the series fixes VFIO. The bug that S390 is trying to fix is that VFIO didn't put back the group ownership, it just left its own iommu_domain attached and called release(). But now, at least for single device groups, VFIO will put owenership back and zdev->s390_domain == NULL when we get to release_device() > That then only leaves the issue that that domain may still become > invalid at any point after the group mutex has been dropped. So that is this race: CPU0 CPU1 iommu_release_device(a) __iommu_group_remove_device(a) iommu_device_use_default_domain(b) iommu_domain_free(domain) iommu_release_device(b) ops->release_device(b) ops->release_device(a) // Boom, a is still attached to domain :( I can't think of how to solve this other than holding the group mutex across release_device. See below. > > And to your other question, the reason I split the function is because > > I couldn't really say WTF iommu_group_remove_device() was supposed to > > do. The __ version make ssense as part of the remove_device, due to > > the sequencing with ops->release() > > > > But the other one doesn't have that. So I want to put in a: > > > > WARN_ON(group->blocking_domain || group->default_domain); > > > > Because calling it after those domains are allocated looks broken to > > me. > > I might be misunderstanding, but that sounds backwards - if a real device is > being hotplugged out, we absolutely expect that to happen *after* its > default domain has been set up. See below for what I mean iommu_group_remove_device() doesn't work as an API because it has no way to tell the device to stop using the domain we are about to free. So it should assert that there is no domain to worry about. For the vfio and power case there is no domain because they don't use iommu drivers For FSL it does not use default domains so it will also be OK. Also, I think FSL is broken and it should not be trying to remove the "PCI controller" from a group. Every PCI device behind an IOMMU should be linked to a group. The only reason I can think someone would wanted to do this is because they ran into trouble with the VFIO viability checks. But we have a robust solution to that now that doesn't require abusing the group like this. Jason diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 780fb70715770d..cb83576b1877d5 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static struct group_device * +__iommu_group_remove_device(struct iommu_group *group, struct device *dev); +static void __iommu_group_remove_device_sysfs(struct iommu_group *group, + struct group_device *device); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name = \ @@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev) void iommu_release_device(struct device *dev) { + struct iommu_group *group = dev->iommu_group; const struct iommu_ops *ops; + struct group_device *device; if (!dev->iommu) return; iommu_device_unlink(dev->iommu->iommu_dev, dev); + mutex_lock(&group->mutex); + device = __iommu_group_remove_device(group, dev); ops = dev_iommu_ops(dev); + + /* + * If the group has become empty then ownership must have been released, + * and the current domain must be set back to NULL or the default + * domain. + */ + if (list_empty(&group->devices)) + WARN_ON(group->owner_cnt || + group->domain != group->default_domain); + + /* + * release_device() must stop using any attached domain on the device. + * If there are still other devices in the group they are not effected + * by this callback. + * + * The IOMMU driver must set the device to either an identity or + * blocking translation and stop using any domain pointer, as it is + * going to be freed. + */ if (ops->release_device) ops->release_device(dev); + mutex_unlock(&group->mutex); + + __iommu_group_remove_device_sysfs(group, device); + + /* + * This will eventually call iommu_group_release() which will free the + * iommu_domains. + */ + dev->iommu_group = NULL; + kobject_put(group->devices_kobj); - iommu_group_remove_device(dev); module_put(ops->owner); dev_iommu_free(dev); } @@ -939,44 +975,69 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev) } EXPORT_SYMBOL_GPL(iommu_group_add_device); -/** - * iommu_group_remove_device - remove a device from it's current group - * @dev: device to be removed - * - * This function is called by an iommu driver to remove the device from - * it's current group. This decrements the iommu group reference count. - */ -void iommu_group_remove_device(struct device *dev) +static struct group_device * +__iommu_group_remove_device(struct iommu_group *group, struct device *dev) { - struct iommu_group *group = dev->iommu_group; - struct group_device *tmp_device, *device = NULL; + struct group_device *device; + + lockdep_assert_held(&group->mutex); if (!group) - return; + return NULL; dev_info(dev, "Removing from iommu group %d\n", group->id); - mutex_lock(&group->mutex); - list_for_each_entry(tmp_device, &group->devices, list) { - if (tmp_device->dev == dev) { - device = tmp_device; + list_for_each_entry(device, &group->devices, list) { + if (device->dev == dev) { list_del(&device->list); - break; + return device; } } - mutex_unlock(&group->mutex); + return NULL; +} +static void __iommu_group_remove_device_sysfs(struct iommu_group *group, + struct group_device *device) +{ if (!device) return; sysfs_remove_link(group->devices_kobj, device->name); - sysfs_remove_link(&dev->kobj, "iommu_group"); + sysfs_remove_link(&device->dev->kobj, "iommu_group"); - trace_remove_device_from_group(group->id, dev); + trace_remove_device_from_group(group->id, device->dev); kfree(device->name); kfree(device); - dev->iommu_group = NULL; +} + +/** + * iommu_group_remove_device - remove a device from it's current group + * @dev: device to be removed + * + * This function is called by an iommu driver to remove the device from + * it's current group. This decrements the iommu group reference count. + */ +void iommu_group_remove_device(struct device *dev) +{ + struct iommu_group *group = dev->iommu_group; + struct group_device *device; + + /* + * Since we don't do ops->release_device() there is no way for the + * driver to stop using any attached domain before we free it. If a + * domain is attached while this function is called it will eventually + * UAF. + * + * Thus it is only useful for cases like VFIO/SPAPR that don't use an + * iommu driver, or for cases like FSL that don't use default domains. + */ + WARN_ON(group->domain); + + mutex_lock(&group->mutex); + device = __iommu_group_remove_device(group, dev); + mutex_unlock(&group->mutex); + __iommu_group_remove_device_sysfs(group, device); kobject_put(group->devices_kobj); } EXPORT_SYMBOL_GPL(iommu_group_remove_device);