Re: radeon ring 0 test failed on arm64

Christian König <christian.koenig@xxxxxxx> · Thu, 17 Mar 2022 13:51:07 +0100

Am 17.03.22 um 13:26 schrieb Peter Geis:
On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@xxxxxxx> wrote:
On 2022-03-17 00:14, Peter Geis wrote:
Good Evening,

I apologize for raising this email chain from the dead, but there have
been some developments that have introduced even more questions.
I've looped the Rockchip mailing list into this too, as this affects
rk356x, and likely the upcoming rk3588 if [1] is to be believed.

TLDR for those not familiar: It seems the rk356x series (and possibly
the rk3588) were built without any outer coherent cache.
This means (unless Rockchip wants to clarify here) devices such as the
ITS and PCIe cannot utilize cache snooping.
This is based on the results of the email chain [2].

The new circumstances are as follows:
The RPi CM4 Adventure Team as I've taken to calling them has been
attempting to get a dGPU working with the very broken Broadcom
controller in the RPi CM4.
Recently they acquired a SoQuartz rk3566 module which is pin
compatible with the CM4, and have taken to trying it out as well.

This is how I got involved.
It seems they found a trivial way to force the Radeon R600 driver to
use Non-Cached memory for everything.
This single line change, combined with using memset_io instead of
memset, allows the ring tests to pass and the card probes successfully
(minus the DMA limitations of the rk356x due to the 32 bit
interconnect).
I discovered using this method that we start having unaligned io
memory access faults (bus errors) when running glmark2-drm (running
glmark2 directly was impossible, as both X and Wayland crashed too
early).
I traced this to using what I thought at the time was an unsafe memcpy
in the mesa stack.
Rewriting this function to force aligned writes solved the problem and
allows glmark2-drm to run to completion.
With some extensive debugging, I found about half a dozen memcpy
functions in mesa that if forced to be aligned would allow Wayland to
start, but with hilarious display corruption (see [3]. [4]).
The CM4 team is convinced this is an issue with memcpy in glibc, but
I'm not convinced it's that simple.

On my two hour drive in to work this morning, I got to thinking.
If this was an memcpy fault, this would be universally broken on arm64
which is obviously not the case.
So I started thinking, what is different here than with systems known to work:
1. No IOMMU for the PCIe controller.
2. The Outer Cache Issue.

Robin:
My questions for you, since you're the smartest person I know about
arm64 memory management:
Could cache snooping permit unaligned accesses to IO to be safe?
No.

Or
Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
No.

Or
Am I insane here?
No. (probably)

CPU access to PCIe has nothing to do with PCIe's access to memory. From
what you've described, my guess is that a GPU BAR gets put in a
non-prefetchable window, such that it ends up mapped as Device memory
(whereas if it were prefetchable it would be Normal Non-Cacheable).
Okay, this is perfect and I think you just put me on the right track
for identifying the exact issue. Thanks!

I've sliced up the non-prefetchable window and given it a prefetchable window.
The 256MB BAR now resides in that window.
However I'm still getting bus errors, so it seems the prefetch isn't
actually happening.
The difference is now the GPU realizes that an error has happened and
initiates recovery, vice before where it seemed to be clueless.
If I understand everything correctly, that's because before the bus
error was raised by the CPU due to the memory flag, vice now where
it's actually the bus raising the alarm.

Mhm, that's really interesting.

The BIF (bus interface) should be able to handle all power of twos 
between 8bits and 128bits on the hardware generation IIRC (but could 
also be 64bits or 256bits, need to check the hw docs as well).

So once the request ended up at the GPU it should be able to handle it. 
Maybe a mis-configured bridge in between?

My next question, is this something the driver should set and isn't,
or is it just because of the broken cache coherency?

As Robin noted as well we have two different issues here:

1. Cache coherency of system memory.
2. Unaligned accesses on IO memory.

The later can actually be avoided if we absolutely have to. E.g. for 
bringup with test the ASICs alone without any DRAM attached. That is so 
called ZFB (zero frame buffer) mode for the driver.

I don't think we ever made the necessary patches for that public, but in 
theory it is possible.

Only the first item is just not solvable cleanly as far as I understand it.

Regards,
Christian.

Robin.
Thanks again!
Peter