Hi David,
On 2018-09-14 12:07 AM, David Gilhooley wrote:
History:
NVIDIA’s Xavier (Tegra194) SOC has multiple SMMU instances that must be
coordinated together. Specifically, there are two instances of ARM’s
MMU-500 shared between coherent DMA devices and one instance of ARM
MMU-500 for non-coherent DMA devices. The two MMU-500s were created to
double the memory bandwidth of a single MMU-500. This is the reason why
Tegra194 does not work with anything but our own Linux fork.
An IOMMU’able device will not know which SMMU it will use. For example,
a device’s memory request could go to MMU 1 or MMU 2. The coherent
memory request gets swizzled in hardware between the two depending on
bandwidth.
Ugh, that sounds like a great way to double TLB and walk cache misses
unless the interconnect is essentially prescient, and probably more than
double the already painful TLB maintenance overhead, but hey ho, what's
done is done.
For this reason we program the two coherent MMUs identically. They share
page table memory so a device will get the same physical address if it
queries MMU 1 or MMU 2.
Is the SMMU integration entirely identical other than the memory map -
i.e. stream IDs, context interrupts, etc. - or are there other
differences the OS needs to be aware of?
Our current implementation involves overriding the write_l and write_q
functions. The new functions write to each of the 3 SMMU registers by
getting the offset from the first SMMU’s base register. We understand
that this is not a permanent fix.
Proposal - Updating arm-smmu.c driver:
Extend the SMMUv2 device tree binding to allow a single device tree node
to represent N instances of an SMMUv2. The cleanest way to do this would
be to allow subnodes for the N-1 instances (For example with 3 SMMUs we
would have the main node and 2 sub nodes). The subnodes would have a
“reg” parameter for the address range and the smmu’s interrupt lines.
Well, my preference would be to program them identically into bypass
then leave them that way, but I guess you do actually want to remap GPU
buffers and such ;)
If it really is just a case of duplicating every register write, then I
think the least-worst option would be to echo what rockchip-iommu does
for their dual-master VOP, which is actually not far off what you have
already. Essentially, add a Tegra-specific compatible and define that to
have two regions in its "reg" property.
If we have to cope with (effectively) more than one interrupt line per
context, though, that's probably going to need some rather more invasive
rework of both binding and driver, and I'd really like to understand the
full details and implications before making any suggestions.
We would also need a way for devices to specify which SMMUs they would
like to attach to.
Wait, I thought the whole point was that a device "will not know which
SMMU it will use"? :/
If we have the option of statically partitioning stream IDs between the
parallel SMMUs, that would be a little less 'out there' and could be
described by the existing bindings; it might just take a bit of work in
the driver internals to make the SMMU association per-set-of-stream-IDs
rather than per-device.
Robin.
We believe this proposal will have the best performance. Because we are
working in the arm-smmu.c file, we can easily have the multiple SMMUs
share page tables and we can do all of the TLB flushing in parallel.
Please let us know if this proposal will work, or if there is a better
solution. We will create and test patches once we know that we will
have the support of upstream.
Best,
David