Re: Updating arm-smmu.c to support NVIDIA's Xavier

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi David,

On 2018-09-14 12:07 AM, David Gilhooley wrote:
History:

NVIDIA’s Xavier (Tegra194) SOC has multiple SMMU instances that must be coordinated together. Specifically, there are two instances of ARM’s
MMU-500 shared between coherent DMA devices and one instance of ARM
MMU-500 for non-coherent DMA devices. The two MMU-500s were created to double the memory bandwidth of a single MMU-500. This is the reason why
Tegra194 does not work with anything but our own Linux fork.

An IOMMU’able device will not know which SMMU it will use. For example, a device’s memory request could go to MMU 1 or MMU 2. The coherent memory request gets swizzled in hardware between the two depending on bandwidth.

Ugh, that sounds like a great way to double TLB and walk cache misses unless the interconnect is essentially prescient, and probably more than double the already painful TLB maintenance overhead, but hey ho, what's done is done.

For this reason we program the two coherent MMUs identically. They share page table memory so a device will get the same physical address if it queries MMU 1 or MMU 2.

Is the SMMU integration entirely identical other than the memory map - i.e. stream IDs, context interrupts, etc. - or are there other differences the OS needs to be aware of?

Our current implementation involves overriding the write_l and write_q functions. The new functions write to each of the 3 SMMU registers by getting the offset from the first SMMU’s base register. We understand that this is not a permanent fix.

Proposal - Updating arm-smmu.c driver:

Extend the SMMUv2 device tree binding to allow a single device tree node to represent N instances of an SMMUv2. The cleanest way to do this would be to allow subnodes for the N-1 instances (For example with 3 SMMUs we would have the main node and 2 sub nodes). The subnodes would have a “reg” parameter for the address range and the smmu’s interrupt lines.

Well, my preference would be to program them identically into bypass then leave them that way, but I guess you do actually want to remap GPU buffers and such ;)

If it really is just a case of duplicating every register write, then I think the least-worst option would be to echo what rockchip-iommu does for their dual-master VOP, which is actually not far off what you have already. Essentially, add a Tegra-specific compatible and define that to have two regions in its "reg" property.

If we have to cope with (effectively) more than one interrupt line per context, though, that's probably going to need some rather more invasive rework of both binding and driver, and I'd really like to understand the full details and implications before making any suggestions.

We would also need a way for devices to specify which SMMUs they would like to attach to.

Wait, I thought the whole point was that a device "will not know which SMMU it will use"? :/

If we have the option of statically partitioning stream IDs between the parallel SMMUs, that would be a little less 'out there' and could be described by the existing bindings; it might just take a bit of work in the driver internals to make the SMMU association per-set-of-stream-IDs rather than per-device.

Robin.

We believe this proposal will have the best performance. Because we are working in the arm-smmu.c file, we can easily have the multiple SMMUs share page tables and we can do all of the TLB flushing in parallel.

Please let us know if this proposal will work, or if there is a better
solution. We will create and test patches once we know that we will
have the support of upstream.

Best,
David



[Index of Archives]     [ARM Kernel]     [Linux ARM]     [Linux ARM MSM]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux