Re: Updating arm-smmu.c to support NVIDIA's Xavier

Mikko Perttunen <cyndis@xxxxxxxx> · Tue, 25 Sep 2018 11:40:34 +0900

On 18/09/2018 18.51, Robin Murphy wrote:
Hi David,

On 2018-09-14 12:07 AM, David Gilhooley wrote:
History:

NVIDIA’s Xavier (Tegra194) SOC has multiple SMMU instances that must 
be coordinated together. Specifically, there are two instances of ARM’s
MMU-500 shared between coherent DMA devices and one instance of ARM
MMU-500 for non-coherent DMA devices. The two MMU-500s were created to 
double the memory bandwidth of a single MMU-500. This is the reason why
Tegra194 does not work with anything but our own Linux fork.

An IOMMU’able device will not know which SMMU it will use. For 
example, a device’s memory request could go to MMU 1 or MMU 2. The 
coherent memory request gets swizzled in hardware between the two 
depending on bandwidth.

Ugh, that sounds like a great way to double TLB and walk cache misses 
unless the interconnect is essentially prescient, and probably more than 
double the already painful TLB maintenance overhead, but hey ho, what's 
done is done.

For this reason we program the two coherent MMUs identically. They 
share page table memory so a device will get the same physical address 
if it queries MMU 1 or MMU 2.

Is the SMMU integration entirely identical other than the memory map - 
i.e. stream IDs, context interrupts, etc. - or are there other 
differences the OS needs to be aware of?

Our current implementation involves overriding the write_l and write_q 
functions. The new functions write to each of the 3 SMMU registers by 
getting the offset from the first SMMU’s base register. We understand 
that this is not a permanent fix.

Proposal - Updating arm-smmu.c driver:

Extend the SMMUv2 device tree binding to allow a single device tree 
node to represent N instances of an SMMUv2. The cleanest way to do 
this would be to allow subnodes for the N-1 instances (For example 
with 3 SMMUs we would have the main node and 2 sub nodes). The 
subnodes would have a “reg” parameter for the address range and the 
smmu’s interrupt lines.

Well, my preference would be to program them identically into bypass 
then leave them that way, but I guess you do actually want to remap GPU 
buffers and such ;)

If it really is just a case of duplicating every register write, then I 
think the least-worst option would be to echo what rockchip-iommu does 
for their dual-master VOP, which is actually not far off what you have 
already. Essentially, add a Tegra-specific compatible and define that to 
have two regions in its "reg" property.

If we have to cope with (effectively) more than one interrupt line per 
context, though, that's probably going to need some rather more invasive 
rework of both binding and driver, and I'd really like to understand the 
full details and implications before making any suggestions.

My understanding is indeed, that it is sufficient to just duplicate 
register writes. There is only one interrupt line per context, and 
stream IDs are shared. David or Krishna should correct if there's 
something extra needed.

We would also need a way for devices to specify which SMMUs they would 
like to attach to >
Wait, I thought the whole point was that a device "will not know which 
SMMU it will use"? :/

I'm not sure what this refers to either. There's the third non-coherent 
SMMU, but I thought that could be modeled as a fully independent SMMU 
instance.

Thanks,
Mikko

If we have the option of statically partitioning stream IDs between the 
parallel SMMUs, that would be a little less 'out there' and could be 
described by the existing bindings; it might just take a bit of work in 
the driver internals to make the SMMU association per-set-of-stream-IDs 
rather than per-device.

Robin.

We believe this proposal will have the best performance. Because we 
are working in the arm-smmu.c file, we can easily have the multiple 
SMMUs share page tables and we can do all of the TLB flushing in 
parallel.

Please let us know if this proposal will work, or if there is a better
solution. We will create and test patches once we know that we will
have the support of upstream.

Best,
David