Re: radeon ring 0 test failed on arm64

Robin Murphy <robin.murphy@xxxxxxx> · Thu, 17 Mar 2022 13:17:39 +0000

On 2022-03-17 12:26, Peter Geis wrote:
On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@xxxxxxx> wrote:

On 2022-03-17 00:14, Peter Geis wrote:
Good Evening,

I apologize for raising this email chain from the dead, but there have
been some developments that have introduced even more questions.
I've looped the Rockchip mailing list into this too, as this affects
rk356x, and likely the upcoming rk3588 if [1] is to be believed.

TLDR for those not familiar: It seems the rk356x series (and possibly
the rk3588) were built without any outer coherent cache.
This means (unless Rockchip wants to clarify here) devices such as the
ITS and PCIe cannot utilize cache snooping.
This is based on the results of the email chain [2].

The new circumstances are as follows:
The RPi CM4 Adventure Team as I've taken to calling them has been
attempting to get a dGPU working with the very broken Broadcom
controller in the RPi CM4.
Recently they acquired a SoQuartz rk3566 module which is pin
compatible with the CM4, and have taken to trying it out as well.

This is how I got involved.
It seems they found a trivial way to force the Radeon R600 driver to
use Non-Cached memory for everything.
This single line change, combined with using memset_io instead of
memset, allows the ring tests to pass and the card probes successfully
(minus the DMA limitations of the rk356x due to the 32 bit
interconnect).
I discovered using this method that we start having unaligned io
memory access faults (bus errors) when running glmark2-drm (running
glmark2 directly was impossible, as both X and Wayland crashed too
early).
I traced this to using what I thought at the time was an unsafe memcpy
in the mesa stack.
Rewriting this function to force aligned writes solved the problem and
allows glmark2-drm to run to completion.
With some extensive debugging, I found about half a dozen memcpy
functions in mesa that if forced to be aligned would allow Wayland to
start, but with hilarious display corruption (see [3]. [4]).
The CM4 team is convinced this is an issue with memcpy in glibc, but
I'm not convinced it's that simple.

On my two hour drive in to work this morning, I got to thinking.
If this was an memcpy fault, this would be universally broken on arm64
which is obviously not the case.
So I started thinking, what is different here than with systems known to work:
1. No IOMMU for the PCIe controller.
2. The Outer Cache Issue.

Robin:
My questions for you, since you're the smartest person I know about
arm64 memory management:
Could cache snooping permit unaligned accesses to IO to be safe?

No.

Or
Is it the lack of an IOMMU that's causing the alignment faults to become fatal?

No.

Or
Am I insane here?

No. (probably)

CPU access to PCIe has nothing to do with PCIe's access to memory. From
what you've described, my guess is that a GPU BAR gets put in a
non-prefetchable window, such that it ends up mapped as Device memory
(whereas if it were prefetchable it would be Normal Non-Cacheable).

Okay, this is perfect and I think you just put me on the right track
for identifying the exact issue. Thanks!

I've sliced up the non-prefetchable window and given it a prefetchable window.
The 256MB BAR now resides in that window.
However I'm still getting bus errors, so it seems the prefetch isn't
actually happening.

Note that "prefetchable" really just means "no side-effects on reads", 
i.e. we can map it with a Normal memory type that technically *allows* 
the CPU to make speculative accesses because they will not be harmful, 
but that's not to say the CPU will do so. Just that if it did, you 
wouldn't notice anyway.

It's entirely possible that the PCIe IP itself doesn't like unaligned 
accesses, so changing the memory type just moves you from an alignment 
fault to an external abort.

The difference is now the GPU realizes that an error has happened and
initiates recovery, vice before where it seemed to be clueless.
If I understand everything correctly, that's because before the bus
error was raised by the CPU due to the memory flag, vice now where
it's actually the bus raising the alarm.

My next question, is this something the driver should set and isn't,
or is it just because of the broken cache coherency?

The general rule for userspace mmap()ing PCIe-attached memory and 
handing it off to glibc or anyone else who might assume it's regular 
system RAM is "don't do that". If it's not access size or alignment that 
falls over, it could be atomic operations, MTE tags, or any other 
new-fangled memory innovation. For the ultimate dream of just plugging 
in a card full of RAM, you either need to look back to ISA or forward to 
CXL ;)

Robin.