Re: Updating arm-smmu.c to support NVIDIA's Xavier

Robin Murphy <robin.murphy@xxxxxxx> · Tue, 18 Sep 2018 10:51:20 +0100

Hi David,

On 2018-09-14 12:07 AM, David Gilhooley wrote:
History:

NVIDIA’s Xavier (Tegra194) SOC has multiple SMMU instances that must be 
coordinated together. Specifically, there are two instances of ARM’s
MMU-500 shared between coherent DMA devices and one instance of ARM
MMU-500 for non-coherent DMA devices. The two MMU-500s were created to 
double the memory bandwidth of a single MMU-500. This is the reason why
Tegra194 does not work with anything but our own Linux fork.

An IOMMU’able device will not know which SMMU it will use. For example, 
a device’s memory request could go to MMU 1 or MMU 2. The coherent 
memory request gets swizzled in hardware between the two depending on 
bandwidth.

Ugh, that sounds like a great way to double TLB and walk cache misses 
unless the interconnect is essentially prescient, and probably more than 
double the already painful TLB maintenance overhead, but hey ho, what's 
done is done.

For this reason we program the two coherent MMUs identically. They share 
page table memory so a device will get the same physical address if it 
queries MMU 1 or MMU 2.

Is the SMMU integration entirely identical other than the memory map - 
i.e. stream IDs, context interrupts, etc. - or are there other 
differences the OS needs to be aware of?

Our current implementation involves overriding the write_l and write_q 
functions. The new functions write to each of the 3 SMMU registers by 
getting the offset from the first SMMU’s base register. We understand 
that this is not a permanent fix.

Proposal - Updating arm-smmu.c driver:

Extend the SMMUv2 device tree binding to allow a single device tree node 
to represent N instances of an SMMUv2. The cleanest way to do this would 
be to allow subnodes for the N-1 instances (For example with 3 SMMUs we 
would have the main node and 2 sub nodes). The subnodes would have a 
“reg” parameter for the address range and the smmu’s interrupt lines.

Well, my preference would be to program them identically into bypass 
then leave them that way, but I guess you do actually want to remap GPU 
buffers and such ;)

If it really is just a case of duplicating every register write, then I 
think the least-worst option would be to echo what rockchip-iommu does 
for their dual-master VOP, which is actually not far off what you have 
already. Essentially, add a Tegra-specific compatible and define that to 
have two regions in its "reg" property.

If we have to cope with (effectively) more than one interrupt line per 
context, though, that's probably going to need some rather more invasive 
rework of both binding and driver, and I'd really like to understand the 
full details and implications before making any suggestions.

We would also need a way for devices to specify which SMMUs they would 
like to attach to.

Wait, I thought the whole point was that a device "will not know which 
SMMU it will use"? :/

If we have the option of statically partitioning stream IDs between the 
parallel SMMUs, that would be a little less 'out there' and could be 
described by the existing bindings; it might just take a bit of work in 
the driver internals to make the SMMU association per-set-of-stream-IDs 
rather than per-device.

Robin.

We believe this proposal will have the best performance. Because we are 
working in the arm-smmu.c file, we can easily have the multiple SMMUs 
share page tables and we can do all of the TLB flushing in parallel.

Please let us know if this proposal will work, or if there is a better
solution. We will create and test patches once we know that we will
have the support of upstream.

Best,
David