On Fri, Feb 16, 2018 at 9:13 AM, Tomasz Figa <tfiga@xxxxxxxxxxxx> wrote: > On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@xxxxxxx> wrote: >> On 15/02/18 04:17, Tomasz Figa wrote: >> [...] >>>> >>>> Could you elaborate on what kind of locking you are concerned about? >>>> As I explained before, the normally happening fast path would lock >>>> dev->power_lock only for the brief moment of incrementing the runtime >>>> PM usage counter. >>> >>> >>> My bad, that's not even it. >>> >>> The atomic usage counter is incremented beforehands, without any >>> locking [1] and the spinlock is acquired only for the sake of >>> validating that device's runtime PM state remained valid indeed [2], >>> which would be the case in the fast path of the same driver doing two >>> mappings in parallel, with the master powered on (and so the SMMU, >>> through device links; if master was not powered on already, powering >>> on the SMMU is unavoidable anyway and it would add much more latency >>> than the spinlock itself). >> >> >> We now have no locking at all in the map path, and only a per-domain lock >> around TLB sync in unmap which is unfortunately necessary for correctness; >> the latter isn't too terrible, since in "serious" hardware it should only be >> serialising a few cpus serving the same device against each other (e.g. for >> multiple queues on a single NIC). >> >> Putting in a global lock which serialises *all* concurrent map and unmap >> calls for *all* unrelated devices makes things worse. Period. Even if the >> lock itself were held for the minimum possible time, i.e. trivially >> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing that >> one cache line around between 96 CPUs across two sockets is not negligible. > > Fair enough. Note that we're in a quite interesting situation now: > a) We need to have runtime PM enabled on Qualcomm SoC to have power > properly managed, > b) We need to have lock-free map/unmap on such distributed systems, > c) If runtime PM is enabled, we need to call into runtime PM from any > code that does hardware accesses, otherwise the IOMMU API (and so DMA > API and then any V4L2 driver) becomes unusable. > > I can see one more way that could potentially let us have all the > three. How about enabling runtime PM only on selected implementations > (e.g. qcom,smmu) and then having all the runtime PM calls surrounded > with if (pm_runtime_enabled()), which is lockless? > Sorry for pinging, but any opinion on this kind of approach? Best regards, Tomasz >> >>> [1] >>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L1028 >>> [2] >>> http://elixir.free-electrons.com/linux/v4.16-rc1/source/drivers/base/power/runtime.c#L613 >>> >>> In any case, I can't imagine this working with V4L2 or anything else >>> relying on any memory management more generic than calling IOMMU API >>> directly from the driver, with the IOMMU device having runtime PM >>> enabled, but without managing the runtime PM from the IOMMU driver's >>> callbacks that need access to the hardware. As I mentioned before, >>> only the IOMMU driver knows when exactly the real hardware access >>> needs to be done (e.g. Rockchip/Exynos don't need to do that for >>> map/unmap if the power is down, but some implementations of SMMU with >>> TLB powered separately might need to do so). >> >> >> It's worth noting that Exynos and Rockchip are relatively small >> self-contained IP blocks integrated closely with the interfaces of their >> relevant master devices; SMMU is an architecture, implementations of which >> may be large, distributed, and have complex and wildly differing internal >> topologies. As such, it's a lot harder to make hardware-specific assumptions >> and/or be correct for all possible cases. >> >> Don't get me wrong, I do ultimately agree that the IOMMU driver is the only >> agent who ultimately knows what calls are going to be necessary for whatever >> operation it's performing on its own hardware*; it's just that for SMMU it >> needs to be implemented in a way that has zero impact on the cases where it >> doesn't matter, because it's not viable to specialise that driver for any >> particular IP implementation/use-case. > > Still, exactly the same holds for the low power embedded use cases, > where we strive for the lowest possible power consumption, while > maintaining performance levels high as well. And so the SMMU code is > expected to also work with our use cases, such as V4L2 or DRM drivers. > Since these points don't hold for current SMMU code, I could say that > the it has been already specialized for large, distributed > implementations. > > Best regards, > Tomasz -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html