Re: [Regression 5.5-rc1] Extremely low GPU performance on NVIDIA Tegra20/30

Thierry Reding <thierry.reding@xxxxxxxxx> · Fri, 13 Dec 2019 16:10:45 +0100

On Fri, Dec 13, 2019 at 12:25:33AM +0300, Dmitry Osipenko wrote:
> Hello Thierry,
> 
> Commit [1] introduced a severe GPU performance regression on Tegra20 and
> Tegra30 using.
> 
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.5-rc1&id=fa6661b7aa0b52073681b0d26742650c8cbd30f3
> 
> Interestingly the performance is okay on Tegra30 if
> CONFIG_TEGRA_HOST1X_FIREWALL=n, but that doesn't make difference for
> Tegra20.
> 
> I was telling you about this problem on the #tegra IRC sometime ago and
> you asked to report it in a trackable form, so finally here it is.
> 
> You could reproduce the problem by running [2] like this
> `grate/texture-filter -f -s` which should produce over 100 FPS for 720p
> display resolution and currently it's ~11 FPS.
> 
> [2]
> https://github.com/grate-driver/grate/blob/master/tests/grate/texture-filter.c
> 
> Previously I was seeing some memory errors coming from Host1x DMA, but
> don't see any errors at all right now.
> 
> I don't see anything done horribly wrong in the offending commit.
> 
> Unfortunately I couldn't dedicate enough time to sit down and debug the
> problem thoroughly yet. Please let me know if you'll find a solution,
> I'll be happy to test it. Thanks in advance!

I suspect that the problem here is that we're now using the DMA API,
which causes the 32-bit ARM DMA/IOMMU glue to be used. I vaguely recall
that that code doesn't coalesce entries in the SG table, so we may end
up calling iommu_map() a lot of times, and miss out on much of the
advantages that the ->iotlb_sync_map() gives us on Tegra20.

At the same time dma_map_sg() will flush caches, which we didn't do
before. This we should be able to improve by passing the attribute
DMA_ATTR_SKIP_CPU_SYNC to dma_map_sg() when we know that the cache
maintenance isn't needed.

And while thinking about it, one other difference is that with the DMA
API we actually map/unmap the buffers for every submission. This is
because the DMA API semantics require that buffers be mapped/unmapped
every time you use them. Previously we would basically only map each
buffer once (at allocation time) and only have to deal with cache
maintenance, so the overhead per submission was drastically lower.

If DMA_ATTR_SKIP_CPU_SYNC doesn't give us enough of an improvement, we
may want to restore explicit IOMMU usage, at least on anything prior to
Tegra124 where we're unlikely to ever use different IOMMU domains anyway
(because they are such a scarce resource).

Thierry
Attachment:
signature.asc

Description: PGP signature