Re: radeon ring 0 test failed on arm64

Christian König <christian.koenig@xxxxxxx> · Thu, 17 Mar 2022 10:14:59 +0100

Hi Peter,

Am 17.03.22 um 01:14 schrieb Peter Geis:
Good Evening,

I apologize for raising this email chain from the dead, but there have
been some developments that have introduced even more questions.
I've looped the Rockchip mailing list into this too, as this affects
rk356x, and likely the upcoming rk3588 if [1] is to be believed.

TLDR for those not familiar: It seems the rk356x series (and possibly
the rk3588) were built without any outer coherent cache.
This means (unless Rockchip wants to clarify here) devices such as the
ITS and PCIe cannot utilize cache snooping.

well, as far as I know that is a clear violation of the PCIe specification.

Coherent access to system memory is simply a must have.

This is based on the results of the email chain [2].

The new circumstances are as follows:
The RPi CM4 Adventure Team as I've taken to calling them has been
attempting to get a dGPU working with the very broken Broadcom
controller in the RPi CM4.
Recently they acquired a SoQuartz rk3566 module which is pin
compatible with the CM4, and have taken to trying it out as well.

This is how I got involved.
It seems they found a trivial way to force the Radeon R600 driver to
use Non-Cached memory for everything.

Yeah, you basically just force it into AGP mode :)

There is just absolutely no guarantee that this works reliable.

This single line change, combined with using memset_io instead of
memset, allows the ring tests to pass and the card probes successfully
(minus the DMA limitations of the rk356x due to the 32 bit
interconnect).
I discovered using this method that we start having unaligned io
memory access faults (bus errors) when running glmark2-drm (running
glmark2 directly was impossible, as both X and Wayland crashed too
early).
I traced this to using what I thought at the time was an unsafe memcpy
in the mesa stack.
Rewriting this function to force aligned writes solved the problem and
allows glmark2-drm to run to completion.
With some extensive debugging, I found about half a dozen memcpy
functions in mesa that if forced to be aligned would allow Wayland to
start, but with hilarious display corruption (see [3]. [4]).
The CM4 team is convinced this is an issue with memcpy in glibc, but
I'm not convinced it's that simple.

Yes exactly that.

Both OpenGL and Vulkan allow the application to mmap() device memory and 
do any memory access they want with that.

This means that changing memcpy is just a futile effort, it's still 
possible for the application to make an unaligned memory access and that 
is perfectly valid.

On my two hour drive in to work this morning, I got to thinking.
If this was an memcpy fault, this would be universally broken on arm64
which is obviously not the case.
So I started thinking, what is different here than with systems known to work:
1. No IOMMU for the PCIe controller.
2. The Outer Cache Issue.

Oh, very good point. I would be interested in that as answer as well.

Regards,
Christian.

Robin:
My questions for you, since you're the smartest person I know about
arm64 memory management:
Could cache snooping permit unaligned accesses to IO to be safe?
Or
Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
Or
Am I insane here?

Rockchip:
Please update on the status for the Outer Cache errata for ITS services.
Please provide an answer to the errata of the PCIe controller, in
regard to cache snooping and buffering, for both the rk356x and the
upcoming rk3588.

[1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FJeffyCN%2Fmirrors%2Fcommit%2F0b985f29304dcb9d644174edacb67298e8049d4f&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZL3jA2VrnynWbUdFG6naaqrZqcnKRq338n%2Bj50DRa74%3D&amp;reserved=0
[2] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F871rbdt4tu.wl-maz%40kernel.org%2FT%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=QZy%2Bt%2Fus5f3yxwrHmXpzerXngPpKp3i9ZsF1UJ%2BHvlU%3D&amp;reserved=0
[3] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953414755970850816%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=c29bc87hxyIvnsBK3Fo7FbF7RwJcFr%2FjgBrLIiBb%2FyY%3D&amp;reserved=0
[4] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcdn.discordapp.com%2Fattachments%2F926487797844541510%2F953424952042852422%2Funknown.png&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4ae2dfa3e8ec4a765f8a08da07ab1cb2%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637830728762044450%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=fwygTk%2BDzdla67rdAYb44vlivlby9lFwtcgjLfJEH4A%3D&amp;reserved=0

Thank you everyone for your time.

Very Respectfully,
Peter Geis

On Wed, May 26, 2021 at 7:21 AM Christian König
<christian.koenig@xxxxxxx> wrote:
Hi Robin,

Am 26.05.21 um 12:59 schrieb Robin Murphy:
On 2021-05-26 10:42, Christian König wrote:
Hi Robin,

Am 25.05.21 um 22:09 schrieb Robin Murphy:
On 2021-05-25 14:05, Alex Deucher wrote:
On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@xxxxxxxxx>
wrote:
On Tue, May 25, 2021 at 8:47 AM Alex Deucher
<alexdeucher@xxxxxxxxx> wrote:
On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@xxxxxxxxx>
wrote:
Good Evening,

I am stress testing the pcie controller on the rk3566-quartz64
prototype SBC.
This device has 1GB available at <0x3 0x00000000> for the PCIe
controller, which makes a dGPU theoretically possible.
While attempting to light off a HD7570 card I manage to get a
modeset
console, but ring0 test fails and disables acceleration.

Note, we do not have UEFI, so all PCIe setup is from the Linux
kernel.
Any insight you can provide would be much appreciated.
Does your platform support PCIe cache coherency with the CPU?  I.e.,
does the CPU allow cache snoops from PCIe devices?  That is required
for the driver to operate.
Ah, most likely not.
This issue has come up already as the GIC isn't permitted to snoop on
the CPUs, so I doubt the PCIe controller can either.

Is there no way to work around this or is it dead in the water?
It's required by the pcie spec.  You could potentially work around it
if you can allocate uncached memory for DMA, but I don't think that is
possible currently.  Ideally we'd figure out some way to detect if a
particular platform supports cache snooping or not as well.
There's device_get_dma_attr(), although I don't think it will work
currently for PCI devices without an OF or ACPI node - we could
perhaps do with a PCI-specific wrapper which can walk up and defer
to the host bridge's firmware description as necessary.

The common DMA ops *do* correctly keep track of per-device coherency
internally, but drivers aren't supposed to be poking at that
information directly.
That sounds like you underestimate the problem. ARM has unfortunately
made the coherency for PCI an optional IP.
Sorry to be that guy, but I'm involved a lot internally with our
system IP and interconnect, and I probably understand the situation
better than 99% of the community ;)
I need to apologize, didn't realized who was answering :)

It just sounded to me that you wanted to suggest to the end user that
this is fixable in software and I really wanted to avoid even more
customers coming around asking how to do this.

For the record, the SBSA specification (the closet thing we have to a
"system architecture") does require that PCIe is integrated in an
I/O-coherent manner, but we don't have any control over what people do
in embedded applications (note that we don't make PCIe IP at all, and
there is plenty of 3rd-party interconnect IP).
So basically it is not the fault of the ARM IP-core, but people are just
stitching together PCIe interconnect IP with a core where it is not
supposed to be used with.

Do I get that correctly? That's an interesting puzzle piece in the picture.

So we are talking about a hardware limitation which potentially can't
be fixed without replacing the hardware.
You expressed interest in "some way to detect if a particular platform
supports cache snooping or not", by which I assumed you meant a
software method for the amdgpu/radeon drivers to call, rather than,
say, a website that driver maintainers can look up SoC names on. I'm
saying that that API already exists (just may need a bit more work).
Note that it is emphatically not a platform-level thing since
coherency can and does vary per device within a system.
Well, I think this is not something an individual driver should mess
with. What the driver should do is just express that it needs coherent
access to all of system memory and if that is not possible fail to load
with a warning why it is not possible.

I wasn't suggesting that Linux could somehow make coherency magically
work when the signals don't physically exist in the interconnect - I
was assuming you'd merely want to do something like throw a big
warning and taint the kernel to help triage bug reports. Some drivers
like ahci_qoriq and panfrost simply need to know so they can program
their device to emit the appropriate memory attributes either way, and
rely on the DMA API to hide the rest of the difference, but if you
want to treat non-coherent use as unsupported because it would require
too invasive changes that's fine by me.
Yes exactly that please. I mean not sure how panfrost is doing it, but
at least the Vulkan userspace API specification requires devices to have
coherent access to system memory.

So even if I would want to do this it is simply not possible because the
application doesn't tell the driver which memory is accessed by the
device and which by the CPU.

Christian.

Robin.