On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote: > Historically PPC64 managed to avoid using iommu_ops. The VFIO driver > uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in > the Type1 VFIO driver. Recent development though has added a coherency > capability check to the generic part of VFIO and essentially disabled > VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). > > This adds an iommu_ops stub which reports support for cache > coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, > this provides minimum code for the probing to not crash. > > Because now we have to set iommu_ops to the system (bus_set_iommu() or > iommu_device_register()), this requires the POWERNV PCI setup to happen > after bus_register(&pci_bus_type) which is postcore_initcall > TODO: check if it still works, read sha1, for more details: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387 > > Because setting the ops triggers probing, this does not work well with > iommu_group_add_device(), hence the move to iommu_probe_device(). > > Because iommu_probe_device() does not take the group (which is why > we had the helper in the first place), this adds > pci_controller_ops::device_group. > > So, basically there is one iommu_device per PHB and devices are added to > groups indirectly via series of calls inside the IOMMU code. > > pSeries is out of scope here (a minor fix needed for barely supported > platform in regard to VFIO). > > The previous discussion is here: > https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@xxxxxxxxx/ I think this is basically OK, for what it is. It looks like there is more some-day opportunity to make use of the core infrastructure though. > does it make sense to have this many callbacks, or > the generic IOMMU code can safely operate without some > (given I add some more checks for !NULL)? thanks, I wouldn't worry about it.. > @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) > pr_debug("%s: Adding %s to iommu group %d\n", > __func__, dev_name(dev), iommu_group_id(table_group->group)); > > - return iommu_group_add_device(table_group->group, dev); > + ret = iommu_probe_device(dev); > + dev_info(dev, "probed with %d\n", ret); For another day, but it seems a bit strange to call iommu_probe_device() like this? Shouldn't one of the existing call sites cover this? The one in of_iommu.c perhaps? > +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev) > +{ > + return false; > +} I think you can NULL this op: static bool iommu_is_attach_deferred(struct device *dev) { const struct iommu_ops *ops = dev_iommu_ops(dev); if (ops->is_attach_deferred) return ops->is_attach_deferred(dev); return false; } > +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev) > +{ > + struct pci_controller *hose; > + struct pci_dev *pdev; > + > + /* Weirdly iommu_device_register() assigns the same ops to all buses */ > + if (!dev_is_pci(dev)) > + return ERR_PTR(-EPERM); > + > + pdev = to_pci_dev(dev); > + hose = pdev->bus->sysdata; > + > + if (!hose->controller_ops.device_group) > + return ERR_PTR(-ENOENT); > + > + return hose->controller_ops.device_group(hose, pdev); > +} Is this missing a refcount get on the group? > + > +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom, > + struct device *dev) > +{ > + return 0; > +} It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. It will turn into a security problem to allow kernel mappings to leak past this point. Jason