On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote: > > Isn't this every driver though? Like every single driver implementing > > an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the > > iommu_domain - minimally to point the HW to the IOPTEs it stores. > > Um, no? Domain ops get the domain passed in as an argument, which is far > from hidden, and if any driver implemented them to ignore that argument and > operate on something else it would be stupid and broken. Note I said > "per-device reference", meaning things like s390's zpci_dev->s390_domain and > SMMUv3's dev->iommu->priv->domain. It's only those references that are > reachable from release_device - outside the normal domain lifecycle - which > are problematic. If the plan is to make the domain refcounted and then allow a 'put' on it before we reach release_device() then it means every driver needs to hold a 'get' on the domain while it is programmed into the HW. Because the hw will still be touching memory that could be freed by an iommu_domain_free(). By "hidden" reference I mean the HW walkers are touching memory that would be freed - ie kasn won't see it. > > Do you know a reason not to hold the group mutex across > > release_device? I think that is the most straightforward and > > future proof. > > Yes, the ones documented in the code and already discussed here. The current > functional ones aren't particularly *good* reasons, but unless and until > they can all be cleaned up they are what they are. Uh, I feel like I missed part of the conversation - I don't know what this list is.. I did look. I'm looking for a call chain that goes from release_device() into a core function that grabs the group->mutex. There is a comment in iommu_group_store_type() that suggest there is a recursion but doesn't say what it is. It was an Intel person who wrote the comment so I more carefully checked the intel driver and didn't find a call path, but it is big and complicated.. There is a commment in iommu_change_dev_def_domain() about recursion on probe_finalize(), not relevant here, AFAICT. So, I did two approaches, one I checked quickly through every release_device looking for something. Then I looked across the entire exported driver facing API and focused on callchains going back toward the core from APIs that might be trouble and audited them almost completely. These APIs do not take the lock, so completely safe: iommu_group_alloc iommu_group_set_iommudata iommu_group_set_name iommu_group_get iommu_group_ref_get iommu_group_put iommu_get_domain_for_dev iommu_fwspec_free iommu_fwspec_init iommu_fwspec_add_ids iommu_put_resv_regions (called from release_device) Does take the lock. Checked all of these in all the drivers, didn't find an obvious path to release_device: iommu_group_remove_device iommu_group_for_each_dev iommu_attach_device iommu_detach_device iommu_attach_group iommu_detach_group bus_set_iommu Can't tell if these take lock due to driver callbacks, but couldn't find them in release, and doesn't make sense they would be there: iommu_device_register iommu_device_unregister iommu_domain_alloc iommu_domain_free Rest of the exported interface touching the drivers - didn't carefully check if they are using the lock - however by name seems unlikely these are in release_device(): iommu_register_device_fault_handler iommu_unregister_device_fault_handler iommu_report_device_fault iommu_page_response report_iommu_fault iommu_iova_to_phys iommu_map iommu_unmap iommu_alloc_resv_region iommu_present iommu_capable iommu_default_passthrough It is big and complicated, so I wouldn't stake my life on it, but it seems worth investigating further. Could the recursion have been cleaned up already? > > > @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev) > > > dev_info(dev, "Removing from iommu group %d\n", group->id); > > > mutex_lock(&group->mutex); > > > + if (WARN_ON(group->domain != group->default_domain && > > > + group->domain != group->blocking_domain)) { > > > > This will false trigger, if there are two VFIO devices then the group > > will remained owned when we unplug one just of them, but the group's domain > > will be a VFIO owned domain. > > As opposed to currently, where most drivers' release_device will blindly > detach/disable the RID in some fashion so the other device would suddenly > blow up anyway? Er, I think it is OK today, in the non-shared case. If the RID isn't shared then each device in the group is independent, so most drivers, most of the time, should only effect the RID release_device() is called on, while this warning will always trigger for any multi-device group. > (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles > StreamID-level aliasing beyond what pci_device_group() covers, which I > remain rather proud of) This is why I prefered not to explicitly change the domain, because at least if someone did write a non-buggy driver it doesn't get wrecked - and making a non-buggy driver is at least allowed by the API. > > And it misses the logic to WARN_ON if a domain is set and an external > > entity wrongly uses iommu_group_remove_device().. > > Huh? An external fake group couldn't have a default domain or blocking > domain, thus any non-NULL domain will not compare equal to either, so if > that could happen it will warn, and then most likely crash. I did think > briefly about trying to make it not crash, but then I remembered that fake > groups from external callers also aren't backed by IOMMU API drivers so have > no way to allocate or attach domains either, so in fact it cannot happen at > all under any circumstances that are worth reasoning about. I mean specificaly thing like FSL is doing where it is a real driver calling this API and the test of 'group->domain == NULL' is the more robust precondition. So, IDK, I would perfer to understand where we hit a group mutex recursion before rejecting that path... If you know specifics please share, otherwise maybe we should stick in a lockdep check there and see if anything hits? But I'm off to LPC so I probably won't write anything more thoughtful on this for a while. Thanks, Jason