On Thu, Feb 22, 2018 at 10:45 PM, Robin Murphy <robin.murphy@xxxxxxx> wrote: > [sorry, I had intended to reply sooner but clearly forgot] > > > On 16/02/18 00:13, Tomasz Figa wrote: >> >> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.murphy@xxxxxxx> >> wrote: >>> >>> On 15/02/18 04:17, Tomasz Figa wrote: >>> [...] >>>>> >>>>> >>>>> Could you elaborate on what kind of locking you are concerned about? >>>>> As I explained before, the normally happening fast path would lock >>>>> dev->power_lock only for the brief moment of incrementing the runtime >>>>> PM usage counter. >>>> >>>> >>>> >>>> My bad, that's not even it. >>>> >>>> The atomic usage counter is incremented beforehands, without any >>>> locking [1] and the spinlock is acquired only for the sake of >>>> validating that device's runtime PM state remained valid indeed [2], >>>> which would be the case in the fast path of the same driver doing two >>>> mappings in parallel, with the master powered on (and so the SMMU, >>>> through device links; if master was not powered on already, powering >>>> on the SMMU is unavoidable anyway and it would add much more latency >>>> than the spinlock itself). >>> >>> >>> >>> We now have no locking at all in the map path, and only a per-domain lock >>> around TLB sync in unmap which is unfortunately necessary for >>> correctness; >>> the latter isn't too terrible, since in "serious" hardware it should only >>> be >>> serialising a few cpus serving the same device against each other (e.g. >>> for >>> multiple queues on a single NIC). >>> >>> Putting in a global lock which serialises *all* concurrent map and unmap >>> calls for *all* unrelated devices makes things worse. Period. Even if the >>> lock itself were held for the minimum possible time, i.e. trivially >>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing >>> that >>> one cache line around between 96 CPUs across two sockets is not >>> negligible. >> >> >> Fair enough. Note that we're in a quite interesting situation now: >> a) We need to have runtime PM enabled on Qualcomm SoC to have power >> properly managed, >> b) We need to have lock-free map/unmap on such distributed systems, >> c) If runtime PM is enabled, we need to call into runtime PM from any >> code that does hardware accesses, otherwise the IOMMU API (and so DMA >> API and then any V4L2 driver) becomes unusable. >> >> I can see one more way that could potentially let us have all the >> three. How about enabling runtime PM only on selected implementations >> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded >> with if (pm_runtime_enabled()), which is lockless? > > > Yes, that's the kind of thing I was gravitating towards - my vague thought > was adding some flag to the smmu_domain, but pm_runtime_enabled() does look > conceptually a lot cleaner. Great, thanks. Looks like we're in agreement now. \o/ Vivek, does this sound reasonable to you? Best regards, Tomasz -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html