On 23/07/2019 13:09, Jon Hunter wrote:
On 23/07/2019 11:29, Robin Murphy wrote:
On 23/07/2019 11:07, Jose Abreu wrote:
From: Jon Hunter <jonathanh@xxxxxxxxxx>
Date: Jul/23/2019, 11:01:24 (UTC+00:00)
This appears to be a winner and by disabling the SMMU for the ethernet
controller and reverting commit 954a03be033c7cef80ddc232e7cbdb17df735663
this worked! So yes appears to be related to the SMMU being enabled. We
had to enable the SMMU for ethernet recently due to commit
954a03be033c7cef80ddc232e7cbdb17df735663.
Finally :)
However, from "git show 954a03be033c7cef80ddc232e7cbdb17df735663":
+ There are few reasons to allow unmatched stream bypass, and
+ even fewer good ones. If saying YES here breaks your board
+ you should work on fixing your board.
So, how can we fix this ? Is your ethernet DT node marked as
"dma-coherent;" ?
The first thing to try would be booting the failing setup with
"iommu.passthrough=1" (or using CONFIG_IOMMU_DEFAULT_PASSTHROUGH) - if
that makes things seem OK, then the problem is likely related to address
translation; if not, then it's probably time to start looking at nasties
like coherency and ordering, although in principle I wouldn't expect the
SMMU to have too much impact there.
Setting "iommu.passthrough=1" works for me. However, I am not sure where
to go from here, so any ideas you have would be great.
OK, so that really implies it's something to do with the addresses. From
a quick skim of the patch, I'm wondering if it's possible for buf->addr
and buf->page->dma_addr to get out-of-sync at any point. The nature of
the IOVA allocator makes it quite likely that a stale DMA address will
have been reused for a new mapping, so putting the wrong address in a
descriptor may well mean the DMA still ends up hitting a valid
translation, but which is now pointing to a different page.
Do you know if the SMMU interrupts are working correctly? If not, it's
possible that an incorrect address or mapping direction could lead to
the DMA transaction just being silently terminated without any fault
indication, which generally presents as inexplicable weirdness (I've
certainly seen that on another platform with the mix of an unsupported
interrupt controller and an 'imperfect' ethernet driver).
If I simply remove the iommu node for the ethernet controller, then I
see lots of ...
[ 6.296121] arm-smmu 12000000.iommu: Unexpected global fault, this could be serious
[ 6.296125] arm-smmu 12000000.iommu: GFSR 0x00000002, GFSYNR0 0x00000000, GFSYNR1 0x00000014, GFSYNR2 0x00000000
So I assume that this is triggering the SMMU interrupt correctly.
According to tegra186.dtsi it appears you're using the MMU-500 combined
interrupt, so if global faults are being delivered then context faults
*should* also, but I'd be inclined to try a quick hack of the relevant
stmmac_desc_ops::set_addr callback to write some bogus unmapped address
just to make sure arm_smmu_context_fault() then screams as expected, and
we're not missing anything else.
Robin.