Re: [PATCH] iommu/io-pgtable-arm: Optimize partial walk flush for large scatter-gather list

Sai Prakash Ranjan <saiprakash.ranjan@xxxxxxxxxxxxxx> · Tue, 15 Jun 2021 17:21:05 +0530

Hi Krishna,

On 2021-06-14 23:18, Krishna Reddy wrote:
Right but we won't know until we profile the specific usecases or try 
them in
generic workload to see if they affect the performance. Sure, over 
invalidation is
a concern where multiple buffers can be mapped to same context and the 
cache
is not usable at the time for lookup and such but we don't do it for 
small buffers
and only for large buffers which means thousands of TLB entry mappings 
in
which case TLBIASID is preferred (note: I mentioned the HW team
recommendation to use it for anything greater than 128 TLB entries) in 
my
earlier reply. And also note that we do this only for partial walk 
flush, we are not
arbitrarily changing all the TLBIs to ASID based.

Most of the heavy bw use cases does involve processing larger buffers.
When the physical memory is allocated dis-contiguously at page_size
(let's use 4KB here)
granularity, each aligned 2MB chunks IOVA unmap would involve
performing a TLBIASID
as 2MB is not a leaf. Essentially, It happens all the time during
large buffer unmaps and
potentially impact active traffic on other large buffers. Depending on 
how much
latency HW engines can absorb, the overflow/underflow issues for ISO
engines can be
sporadic and vendor specific.
Performing TLBIASID as default for all SoCs is not a safe operation.

Ok so from what I gather from this is that its not easy to test for the
negative impact and you don't have data on such yet and the behaviour is
very vendor specific. To add on qcom impl, we have several performance
improvements for TLB cache invalidations in HW like wait-for-safe(for 
realtime
clients such as camera and display) and few others to allow for cache
lookups/updates when TLBI is in progress for the same context bank, so 
atleast
we are good here.

I am no camera expert but from what the camera team mentioned is that 
there
is a thread which frees memory(large unused memory buffers) 
periodically which
ends up taking around 100+ms and causing some camera test failures 
with
frame drops. Parallel efforts are already being made to optimize this 
usage of
thread but as I mentioned previously, this is *not a camera specific*, 
lets say
someone else invokes such large unmaps, it's going to face the same 
issue.

From the above, It doesn't look like the root cause of frame drops is
fully understood.
Why is 100+ms delay causing camera frame drop?  Is the same thread
submitting the buffers
to camera after unmap is complete? If not, how is the unmap latency
causing issue here?

Ok since you are interested in camera usecase, I have requested for more 
details
from the camera team and will give it once they comeback. However I 
don't think
its good to have unmap latency at all and that is being addressed by 
this patch.

> If unmap is queued and performed on a back ground thread, would it
> resolve the frame drops?

Not sure I understand what you mean by queuing on background thread 
but with
that or not, we still do the same number of TLBIs and hop through
iommu->io-pgtable->arm-smmu to perform the the unmap, so how will that
help?

I mean adding the unmap requests into a queue and processing them from
a different thread.
It is not to reduce the TLBIs. But, not to block subsequent buffer
allocation, IOVA map requests, if they
are being requested from same thread that is performing unmap. If
unmap is already performed from
a different thread, then the issue still need to be root caused to
understand it fully. Check for any
serialization issues.

This patch is to optimize unmap latency because of large number of mmio 
writes(TLBIVAs)
wasting CPU cycles and not to fix camera issue which can probably be 
solved by
parallelization. It seems to me like you are ok with the unmap latency 
in general
which we are not and want to avoid that latency.

Hi @Robin, from these discussions it seems they are not ok with the 
change
for all SoC vendor implementations and do not have any data on such 
impact.
As I mentioned above, on QCOM platforms we do have several optimizations 
in HW
for TLBIs and would like to make use of it and reduce the unmap latency.
What do you think, should this be made implementation specific?

Thanks,
Sai

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member
of Code Aurora Forum, hosted by The Linux Foundation