On 18/09/2018 18.51, Robin Murphy wrote:
Hi David,
On 2018-09-14 12:07 AM, David Gilhooley wrote:
History:
NVIDIA’s Xavier (Tegra194) SOC has multiple SMMU instances that must
be coordinated together. Specifically, there are two instances of ARM’s
MMU-500 shared between coherent DMA devices and one instance of ARM
MMU-500 for non-coherent DMA devices. The two MMU-500s were created to
double the memory bandwidth of a single MMU-500. This is the reason why
Tegra194 does not work with anything but our own Linux fork.
An IOMMU’able device will not know which SMMU it will use. For
example, a device’s memory request could go to MMU 1 or MMU 2. The
coherent memory request gets swizzled in hardware between the two
depending on bandwidth.
Ugh, that sounds like a great way to double TLB and walk cache misses
unless the interconnect is essentially prescient, and probably more than
double the already painful TLB maintenance overhead, but hey ho, what's
done is done.
For this reason we program the two coherent MMUs identically. They
share page table memory so a device will get the same physical address
if it queries MMU 1 or MMU 2.
Is the SMMU integration entirely identical other than the memory map -
i.e. stream IDs, context interrupts, etc. - or are there other
differences the OS needs to be aware of?
Our current implementation involves overriding the write_l and write_q
functions. The new functions write to each of the 3 SMMU registers by
getting the offset from the first SMMU’s base register. We understand
that this is not a permanent fix.
Proposal - Updating arm-smmu.c driver:
Extend the SMMUv2 device tree binding to allow a single device tree
node to represent N instances of an SMMUv2. The cleanest way to do
this would be to allow subnodes for the N-1 instances (For example
with 3 SMMUs we would have the main node and 2 sub nodes). The
subnodes would have a “reg” parameter for the address range and the
smmu’s interrupt lines.
Well, my preference would be to program them identically into bypass
then leave them that way, but I guess you do actually want to remap GPU
buffers and such ;)
If it really is just a case of duplicating every register write, then I
think the least-worst option would be to echo what rockchip-iommu does
for their dual-master VOP, which is actually not far off what you have
already. Essentially, add a Tegra-specific compatible and define that to
have two regions in its "reg" property.
If we have to cope with (effectively) more than one interrupt line per
context, though, that's probably going to need some rather more invasive
rework of both binding and driver, and I'd really like to understand the
full details and implications before making any suggestions.
My understanding is indeed, that it is sufficient to just duplicate
register writes. There is only one interrupt line per context, and
stream IDs are shared. David or Krishna should correct if there's
something extra needed.
We would also need a way for devices to specify which SMMUs they would
like to attach to >
Wait, I thought the whole point was that a device "will not know which
SMMU it will use"? :/
I'm not sure what this refers to either. There's the third non-coherent
SMMU, but I thought that could be modeled as a fully independent SMMU
instance.
Thanks,
Mikko
If we have the option of statically partitioning stream IDs between the
parallel SMMUs, that would be a little less 'out there' and could be
described by the existing bindings; it might just take a bit of work in
the driver internals to make the SMMU association per-set-of-stream-IDs
rather than per-device.
Robin.
We believe this proposal will have the best performance. Because we
are working in the arm-smmu.c file, we can easily have the multiple
SMMUs share page tables and we can do all of the TLB flushing in
parallel.
Please let us know if this proposal will work, or if there is a better
solution. We will create and test patches once we know that we will
have the support of upstream.
Best,
David