Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

Alexey Kardashevskiy <aik@xxxxxxxxx> · Fri, 8 Jul 2022 16:34:55 +1000






On 7/8/22 15:00, Alexey Kardashevskiy wrote:


On 7/8/22 01:10, Jason Gunthorpe wrote:
On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI 
devices,
this provides minimum code for the probing to not crash.

Because now we have to set iommu_ops to the system (bus_set_iommu() or
iommu_device_register()), this requires the POWERNV PCI setup to happen
after bus_register(&pci_bus_type) which is postcore_initcall
TODO: check if it still works, read sha1, for more details:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387

Because setting the ops triggers probing, this does not work well with
iommu_group_add_device(), hence the move to iommu_probe_device().

Because iommu_probe_device() does not take the group (which is why
we had the helper in the first place), this adds
pci_controller_ops::device_group.

So, basically there is one iommu_device per PHB and devices are added to
groups indirectly via series of calls inside the IOMMU code.

pSeries is out of scope here (a minor fix needed for barely supported
platform in regard to VFIO).

The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@xxxxxxxxx/

I think this is basically OK, for what it is. It looks like there is
more some-day opportunity to make use of the core infrastructure though.

does it make sense to have this many callbacks, or
the generic IOMMU code can safely operate without some
(given I add some more checks for !NULL)? thanks,

I wouldn't worry about it..

@@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
      pr_debug("%s: Adding %s to iommu group %d\n",
           __func__, dev_name(dev),  
iommu_group_id(table_group->group));
-    return iommu_group_add_device(table_group->group, dev);
+    ret = iommu_probe_device(dev);
+    dev_info(dev, "probed with %d\n", ret);

For another day, but it seems a bit strange to call 
iommu_probe_device() like this?
Shouldn't one of the existing call sites cover this? The one in
of_iommu.c perhaps?


It looks to me that of_iommu.c expects the iommu setup to happen before 
linux starts as linux looks for #iommu-cells or iommu-map properties in 
the device tree. The powernv firmware (aka skiboot) does not do this and 
it is linux which manages iommu groups.


+static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
+{
+       return false;
+}

I think you can NULL this op:

static bool iommu_is_attach_deferred(struct device *dev)
{
    const struct iommu_ops *ops = dev_iommu_ops(dev);

    if (ops->is_attach_deferred)
        return ops->is_attach_deferred(dev);

    return false;
}

+static struct iommu_group *spapr_tce_iommu_device_group(struct 
device *dev)
+{
+    struct pci_controller *hose;
+    struct pci_dev *pdev;
+
+    /* Weirdly iommu_device_register() assigns the same ops to all 
buses */
+    if (!dev_is_pci(dev))
+        return ERR_PTR(-EPERM);
+
+    pdev = to_pci_dev(dev);
+    hose = pdev->bus->sysdata;
+
+    if (!hose->controller_ops.device_group)
+        return ERR_PTR(-ENOENT);
+
+    return hose->controller_ops.device_group(hose, pdev);
+}

Is this missing a refcount get on the group?

+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+                      struct device *dev)
+{
+    return 0;
+}

It is important when this returns that the iommu translation is all
emptied. There should be no left over translations from the DMA API at
this point. I have no idea how power works in this regard, but it
should be explained why this is safe in a comment at a minimum.

 > It will turn into a security problem to allow kernel mappings to leak
 > past this point.
 >

I've added for v2 checking for no valid mappings for a device (or, more 
precisely, in the associated iommu_group), this domain does not need 
checking, right?


Uff, not that simple. Looks like once a device is in a group, its 
dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess then 
there is a way to set those to NULL or do something similar to let
dma_map_direct() from kernel/dma/mapping.c return "true", is not there?

For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is 
fine to do nothing as tce_iommu_take_ownership() and 
tce_iommu_take_ownership_ddw() take care of not having active DMA 
mappings. Thanks,



In general, is "domain" something from hardware or it is a software 
concept? Thanks,


Jason


--
Alexey