Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache

Robin Murphy <robin.murphy@xxxxxxx> · Mon, 21 Jan 2019 15:15:15 +0000

On 21/01/2019 14:24, Ard Biesheuvel wrote:
On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@xxxxxxx> wrote:

On 21/01/2019 13:36, Ard Biesheuvel wrote:
On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@xxxxxxx> wrote:

On 21/01/2019 10:50, Ard Biesheuvel wrote:
On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@xxxxxxxxxxxxxx> wrote:

Hi,

On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
<ard.biesheuvel@xxxxxxxxxx> wrote:

On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@xxxxxxxxxxxxxx> wrote:

Qualcomm SoCs have an additional level of cache called as
System cache, aka. Last level cache (LLC). This cache sits right
before the DDR, and is tightly coupled with the memory controller.
The clients using this cache request their slices from this
system cache, make it active, and can then start using it.
For these clients with smmu, to start using the system cache for
buffers and, related page tables [1], memory attributes need to be
set accordingly. This series add the required support.

Does this actually improve performance on reads from a device? The
non-cache coherent DMA routines perform an unconditional D-cache
invalidate by VA to the PoC before reading from the buffers filled by
the device, and I would expect the PoC to be defined as lying beyond
the LLC to still guarantee the architected behavior.

We have seen performance improvements when running Manhattan
GFXBench benchmarks.

Ah ok, that makes sense, since in that case, the data flow is mostly
to the device, not from the device.

As for the PoC, from my knowledge on sdm845 the system cache, aka
Last level cache (LLC) lies beyond the point of coherency.
Non-cache coherent buffers will not be cached to system cache also, and
no additional software cache maintenance ops are required for system cache.
Pratik can add more if I am missing something.

To take care of the memory attributes from DMA APIs side, we can add a
DMA_ATTR definition to take care of any dma non-coherent APIs calls.

So does the device use the correct inner non-cacheable, outer
writeback cacheable attributes if the SMMU is in pass-through?

We have been looking into another use case where the fact that the
SMMU overrides memory attributes is causing issues (WC mappings used
by the radeon and amdgpu driver). So if the SMMU would honour the
existing attributes, would you still need the SMMU changes?

Even if we could force a stage 2 mapping with the weakest pagetable
attributes (such that combining would work), there would still need to
be a way to set the TCR attributes appropriately if this behaviour is
wanted for the SMMU's own table walks as well.

Isn't that just a matter of implementing support for SMMUs that lack
the 'dma-coherent' attribute?

Not quite - in general they need INC-ONC attributes in case there
actually is something in the architectural outer-cacheable domain.

But is it a problem to use INC-ONC attributes for the SMMU PTW on this
chip? AIUI, the reason for the SMMU changes is to avoid the
performance hit of snooping, which is more expensive than cache
maintenance of SMMU page tables. So are you saying the by-VA cache
maintenance is not relayed to this system cache, resulting in page
table updates to be invisible to masters using INC-ONC attributes?

I only have a relatively vague impression of how this Qcom interconnect 
actually behaves, but AIUI the outer attribute has no correctness impact 
(it's effectively mismatched between CPU and devices already), only some 
degree of latency improvement which is effectively the opposite of 
no-snoop, in allowing certain non-coherent device traffic to still 
allocate in the LLC. I'm assuming that if that latency matters for the 
device accesses themselves, it might also matter for the associated 
table walks depending on the TLB miss rate.

Robin.