Re: radeon ring 0 test failed on arm64

Robin Murphy <robin.murphy@xxxxxxx> · Thu, 17 Mar 2022 10:37:12 +0000

On 2022-03-17 00:14, Peter Geis wrote:
Good Evening,

I apologize for raising this email chain from the dead, but there have
been some developments that have introduced even more questions.
I've looped the Rockchip mailing list into this too, as this affects
rk356x, and likely the upcoming rk3588 if [1] is to be believed.

TLDR for those not familiar: It seems the rk356x series (and possibly
the rk3588) were built without any outer coherent cache.
This means (unless Rockchip wants to clarify here) devices such as the
ITS and PCIe cannot utilize cache snooping.
This is based on the results of the email chain [2].

The new circumstances are as follows:
The RPi CM4 Adventure Team as I've taken to calling them has been
attempting to get a dGPU working with the very broken Broadcom
controller in the RPi CM4.
Recently they acquired a SoQuartz rk3566 module which is pin
compatible with the CM4, and have taken to trying it out as well.

This is how I got involved.
It seems they found a trivial way to force the Radeon R600 driver to
use Non-Cached memory for everything.
This single line change, combined with using memset_io instead of
memset, allows the ring tests to pass and the card probes successfully
(minus the DMA limitations of the rk356x due to the 32 bit
interconnect).
I discovered using this method that we start having unaligned io
memory access faults (bus errors) when running glmark2-drm (running
glmark2 directly was impossible, as both X and Wayland crashed too
early).
I traced this to using what I thought at the time was an unsafe memcpy
in the mesa stack.
Rewriting this function to force aligned writes solved the problem and
allows glmark2-drm to run to completion.
With some extensive debugging, I found about half a dozen memcpy
functions in mesa that if forced to be aligned would allow Wayland to
start, but with hilarious display corruption (see [3]. [4]).
The CM4 team is convinced this is an issue with memcpy in glibc, but
I'm not convinced it's that simple.

On my two hour drive in to work this morning, I got to thinking.
If this was an memcpy fault, this would be universally broken on arm64
which is obviously not the case.
So I started thinking, what is different here than with systems known to work:
1. No IOMMU for the PCIe controller.
2. The Outer Cache Issue.

Robin:
My questions for you, since you're the smartest person I know about
arm64 memory management:
Could cache snooping permit unaligned accesses to IO to be safe?

No.

Or
Is it the lack of an IOMMU that's causing the alignment faults to become fatal?

No.

Or
Am I insane here?

No. (probably)

CPU access to PCIe has nothing to do with PCIe's access to memory. From 
what you've described, my guess is that a GPU BAR gets put in a 
non-prefetchable window, such that it ends up mapped as Device memory 
(whereas if it were prefetchable it would be Normal Non-Cacheable).

Robin.