Re: [REGRESSION] rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"

Christian König <christian.koenig@xxxxxxx> · Tue, 31 Oct 2023 14:02:17 +0100

Hi Alexey,

trying to answer some of the questions since Alex is currently on vacation.

Am 30.10.23 um 17:01 schrieb Alexey Klimov:
Hi Alex,

On Thu, 26 Oct 2023 at 19:53, Alex Deucher <alexdeucher@xxxxxxxxx> wrote:
On Thu, Oct 26, 2023 at 1:33 PM Alexey Klimov <alexey.klimov@xxxxxxxxxx> wrote:
#regzbot introduced: 1cfb4d612127
#regzbot title: rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"

Hi all,

I've been playing with RX7600 and it was observed that amdgpu stopped working between kernel 6.2 and 6.5.
Then I narrowed it down to 6.4 <-> 6.5-rc1 and finally bisect pointed at 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21
And I manually checked if it boots/works on the previous commit and the mentioned one.

I guess the log also reveals warning in error path. Please see below.

I didn't check any further. This is simple debian testing system with the following cmdline options:
root@avadebian:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.6-rc7+ ignore_loglevel root=/dev/nvme1n1p2 ro nr_cpus=32

So far simple revert (patch is below) returns things back to normal-ish: there are huge graphics artifacts on Xorg/X11 under 6.1 to upstream kernel. Wayland-based sway works great without issues. Not sure where should I report this.

Please let me know if I can help debugging, testing or provide some other logs regarding 1cfb4d612127? Any cmdline options to collect more info?
Please make sure you have this patch as well:
e602157ec089240861cd641ee2c7c64eeaec09bf ("drm/amdgpu: fix S3 issue if
MQD in VRAM")
Please open a ticket here so we can track this:
https://gitlab.freedesktop.org/drm/amd/-/issues/
The patch was there during testing and I will open a ticket there.

I think I see the problem.  Please see if attached patch 1 fixes the
issue.  If this fixes it, that would also explain the issues you are
seeing with Xorg.  It would appear there are limitations around MMIO
access on your platform and unfortunately most graphics APIs require
unaligned access to MMIO space with the CPU.  We can fix the kernel
side pretty easily, but userspace will be a problem.
Does it mean that we don't have unaligned access to PCIe MMIO space on
this Adlink Ampere AVA arm64 platform?

Yes, that is perfectly possible and makes that platform unusable for 
most gfx applications.

We had tons of reports for different ARM boards and HW generations and 
even looped in some ARM engineers.

Essentially if you want to run high level GFX stacks like Vulkan and 
OpenGL on a platform with AMD or NVIDIA hardware your platform needs to 
fulfill certain requirements:

1. Correctly implement the PCIe spec!

    We actually have tons of boards where people attach an PCIe root 
complex to the ARM CPU and expect that to work. The problem is that this 
isn't PCIe compliant!
    You actually need the ARM IP for PCIe for this to work correctly, 
without that the root complex can't do system memory coherent 
transactions for example.

2. Be able to run all types of memory accesses on PCIe BARs. For example 
some platforms can't do large reads and writes (vector operations) to 
PCIe BARs, but can do them to system memory.

    This is actually not a hardware requirement, but one of the Vulkan 
and OpenGL stack and applications based on them.
    You can work around this by disallowing CPU access to PCIe BARs, 
but that either cripples performance or even results in applications not 
working at all.

Do you know if it is related to the thing that PCIe BARs are mapped as
a device memory and not a normal memory? (and they should be mapped as
normal memory)

Depends on what you mean with this. When changing the mapping type 
results in allowing unaligned and bigger accesses then yes that would help.

We will upstream the patches to make at least the kernel side work as 
expected, but that's fixing only halve of the problem.

Regards,
Christian.

[..]

Just removing the addition of the AMDGPU_GEM_DOMAIN_VRAM domain here
will revert the behavior.  Since this is an important optimization and
we aren't seeing any issues on x86, I'd prefer to just limit your arch
to GTT if we can't resolve it some other way.

Try patch 1 and if that doesn't work we can fall back to some variant
of patch 2.
The patch 1 alone doesn't fix the issue. Both patches 1 & 2 do work
and amdgpu initializes. Still issues with Xorg and wayland works okay.

Apart from that I observed "amdgpu: [gfxhub] page fault" one time:

[   12.432567] amdgpu 000d:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]]
JPEG decode initialized successfully.
[   12.442516] amdgpu 000d:03:00.0: amdgpu: [gfxhub] page fault
(src_id:0 ring:72 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[   12.454080] amdgpu 000d:03:00.0: amdgpu:   in page starting at
address 0x00000000044b0000 from client 10
[   12.457317] usb 1-4.4: new high-speed USB device number 4 using xhci_hcd
[   12.463548] amdgpu 000d:03:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00000890
[   12.463551] amdgpu 000d:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4)
[   12.484914] amdgpu 000d:03:00.0: amdgpu: MORE_FAULTS: 0x0
[   12.490474] amdgpu 000d:03:00.0: amdgpu: WALKER_ERROR: 0x0
[   12.496121] amdgpu 000d:03:00.0: amdgpu: PERMISSION_FAULTS: 0x9
[   12.502202] amdgpu 000d:03:00.0: amdgpu: MAPPING_ERROR: 0x0
[   12.507934] amdgpu 000d:03:00.0: amdgpu: RW: 0x0
[   12.512716] amdgpu 000d:03:00.0: amdgpu: [gfxhub] page fault
(src_id:0 ring:221 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[   12.524355] amdgpu 000d:03:00.0: amdgpu:   in page starting at
address 0x00000000044b1000 from client 10
[   12.533821] amdgpu 000d:03:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x000009BB
[   12.541464] amdgpu 000d:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4)
[   12.548499] amdgpu 000d:03:00.0: amdgpu: MORE_FAULTS: 0x1
[   12.554059] amdgpu 000d:03:00.0: amdgpu: WALKER_ERROR: 0x5
[   12.558700] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=0

I am not sure if it related to clean 6.6 kernel or to additional patches 1 & 2.
I did around 20 boots of clean 6.6-rc7 version and didn't observe this
page fault.
During 20 reboots of  6.6-rc7 + your patches 1 and 2 -- this page
fault was observed one time only.
Couldn't say how reproducible is this. The log is attached.

Let me know if you want me to test/Ack patch 2 if you are going to send it.

Thanks,
Alexey