Re: [REGRESSION] rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 26, 2023 at 1:33 PM Alexey Klimov <alexey.klimov@xxxxxxxxxx> wrote:
>
> #regzbot introduced: 1cfb4d612127
> #regzbot title: rx7600 stopped working after "1cfb4d612127 drm/amdgpu: put MQDs in VRAM"
>
> Hi all,
>
> I've been playing with RX7600 and it was observed that amdgpu stopped working between kernel 6.2 and 6.5.
> Then I narrowed it down to 6.4 <-> 6.5-rc1 and finally bisect pointed at 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21
> And I manually checked if it boots/works on the previous commit and the mentioned one.
>
> I guess the log also reveals warning in error path. Please see below.
>
> I didn't check any further. This is simple debian testing system with the following cmdline options:
> root@avadebian:~# cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-6.6-rc7+ ignore_loglevel root=/dev/nvme1n1p2 ro nr_cpus=32
>
> So far simple revert (patch is below) returns things back to normal-ish: there are huge graphics artifacts on Xorg/X11 under 6.1 to upstream kernel. Wayland-based sway works great without issues. Not sure where should I report this.
>
> Please let me know if I can help debugging, testing or provide some other logs regarding 1cfb4d612127? Any cmdline options to collect more info?

Please make sure you have this patch as well:
e602157ec089240861cd641ee2c7c64eeaec09bf ("drm/amdgpu: fix S3 issue if
MQD in VRAM")
Please open a ticket here so we can track this:
https://gitlab.freedesktop.org/drm/amd/-/issues/
I think I see the problem.  Please see if attached patch 1 fixes the
issue.  If this fixes it, that would also explain the issues you are
seeing with Xorg.  It would appear there are limitations around MMIO
access on your platform and unfortunately most graphics APIs require
unaligned access to MMIO space with the CPU.  We can fix the kernel
side pretty easily, but userspace will be a problem.

More below.

>
> Thanks,
> Alexey
>
>
>
> From 214372d5cedcf8757dd80d5f4d058377a3d92c52 Mon Sep 17 00:00:00 2001
> From: Alexey Klimov <alexey.klimov@xxxxxxxxxx>
> Date: Thu, 26 Oct 2023 17:01:02 +0100
> Subject: [PATCH] drm/amdgpu: Revert "drm/amdgpu: put MQDs in VRAM"
>
> This reverts commit 1cfb4d6121276a829aa94d0e32a7f5e1830ebc21.
>
> amdgpu driver fails during initialisation with RX7600/gfx11 on
> ADLINK Ampere Altra Developer Platform (AVA developer platform)
> with mentioned commit:
>
> [   12.559893] [drm] Display Core v3.2.247 initialized on DCN 3.2.1
> [   12.565906] [drm] DP-HDMI FRL PCON supported
> [   12.572192] [drm] DMUB hardware initialized: version=0x07000C00
> [   12.582541] snd_hda_intel 000d:03:00.1: bound 000d:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> [   12.625357] [drm] kiq ring mec 3 pipe 1 q 0
> [   12.857087] amdgpu 000d:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.0.0 test failed (-110)
> [   12.867930] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <gfx_v11_0> failed -110
> [   12.877289] amdgpu 000d:03:00.0: amdgpu: amdgpu_device_ip_init failed
> [   12.883723] amdgpu 000d:03:00.0: amdgpu: Fatal error during GPU init
> [   12.890070] amdgpu 000d:03:00.0: amdgpu: amdgpu: finishing device.
> [   12.896586] [drm] DSC precompute is not needed.
> [   12.901142] ------------[ cut here ]------------
> [   12.905747] WARNING: CPU: 0 PID: 212 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:615 amdgpu_irq_put+0xa8/0xc8 [amdgpu]
> [   12.916841] Modules linked in: hid_generic(E) usbhid(E) hid(E) qrtr(E) iptable_nat(E) amdgpu(E+) nf_nat(E) nf_conntrack(E) snd_hda_codec_hdmi(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) iptable_mangle(E) iptable_filter(E) amdxcp(E) drm_exec(E) gpu_sched(E) snd_hda_intel(E) aes_ce_blk(E) snd_intel_dspcfg(E) drm_buddy(E) aes_ce_cipher(E) snd_hda_codec(E) xhci_pci(E) video(E) crct10dif_ce(E) polyval_ce(E) snd_hda_core(E) xhci_hcd(E) drm_suballoc_helper(E) snd_hwdep(E) polyval_generic(E) drm_ttm_helper(E) snd_pcm(E) ghash_ce(E) ast(E) ttm(E) gf128mul(E) snd_timer(E) ipmi_ssif(E) drm_display_helper(E) drm_shmem_helper(E) sha2_ce(E) sha256_arm64(E) ipmi_devintf(E) usbcore(E) snd(E) drm_kms_helper(E) igb(E) sha1_ce(E) sbsa_gwdt(E) ipmi_msghandler(E) arm_spe_pmu(E) soundcore(E) usb_common(E) i2c_algo_bit(E) cppc_cpufreq(E) i2c_designware_platform(E) arm_dsu_pmu(E) arm_cmn(E) xgene_hwmon(E) i2c_designware_core(E) evdev(E) binfmt_misc(E) loop(E) fuse(E) efi_pstore(E) drm(E) dm_mod(E) dax(E) configfs(E) efivarfs(E)
> [   12.916916]  ip_tables(E) x_tables(E) autofs4(E)
> [   13.011111] CPU: 0 PID: 212 Comm: kworker/0:2 Tainted: G            E      6.6.0-rc7+ #23
> [   13.019277] Hardware name: ADLINK Ampere Altra Developer Platform/Ampere Altra Developer Platform, BIOS TianoCore 2.04.100.10 (SYS: 2.06.20220308) 04/18/2
> [   13.033084] Workqueue: events work_for_cpu_fn
> [   13.037434] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   13.044384] pc : amdgpu_irq_put+0xa8/0xc8 [amdgpu]
> [   13.049652] lr : amdgpu_fence_driver_hw_fini+0x118/0x160 [amdgpu]
> [   13.056220] sp : ffff80008012bc10
> [   13.059522] x29: ffff80008012bc20 x28: 0000000000000000 x27: 0000000000000000
> [   13.066647] x26: 0000000000000000 x25: ffff07ff98580010 x24: ffff07ff98580000
> [   13.073772] x23: ffff07ff985a78f0 x22: ffff07ff98580010 x21: ffff07ff985904c8
> [   13.080896] x20: ffff07ff985900e8 x19: ffff07ff98598580 x18: 0000000000000006
> [   13.088020] x17: 0000000000000020 x16: ffffbb510d0d7140 x15: fffffffffffffefb
> [   13.095145] x14: 0000000000000000 x13: 2e64656465656e20 x12: ffff07ff8c7fd9e0
> [   13.102268] x11: 00000000000003e8 x10: ffff07ff8c7fd9e0 x9 : ffffbb50ac3345e0
> [   13.109392] x8 : ffffbb50abf18000 x7 : 0000000000000000 x6 : 000000007a456104
> [   13.116516] x5 : 0000000000000000 x4 : ffff07ff98580000 x3 : 0000000000000000
> [   13.123641] x2 : 0000000000000000 x1 : ffff07ff985a78f0 x0 : ffff07ffc5fd4000
> [   13.130765] Call trace:
> [   13.133200]  amdgpu_irq_put+0xa8/0xc8 [amdgpu]
> [   13.138121]  amdgpu_device_fini_hw+0xb8/0x380 [amdgpu]
> [   13.143732]  amdgpu_driver_unload_kms+0x54/0x80 [amdgpu]
> [   13.149517]  amdgpu_driver_load_kms+0x100/0x1c0 [amdgpu]
> [   13.155301]  amdgpu_pci_probe+0x134/0x428 [amdgpu]
> [   13.160564]  local_pci_probe+0x48/0xb8
> [   13.164305]  work_for_cpu_fn+0x24/0x40
> [   13.168043]  process_one_work+0x170/0x3d0
> [   13.172042]  worker_thread+0x2bc/0x3e0
> [   13.175781]  kthread+0x118/0x128
> [   13.178999]  ret_from_fork+0x10/0x20
> [   13.182564] ---[ end trace 0000000000000000 ]---
> ...
> [   16.984679] amdgpu: probe of 000d:03:00.0 failed with error -110
>
> Cc: Luben Tuikov <luben.tuikov@xxxxxxx>
> Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> Fixes: 1cfb4d612127 drm/amdgpu: put MQDs in VRAM
> Signed-off-by: Alexey Klimov <alexey.klimov@xxxxxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 9 ++-------
>  drivers/gpu/drm/amd/amdgpu/mes_v10_1.c  | 1 -
>  drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 1 -
>  3 files changed, 2 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> index 2382921710ec..1f2d8be0fc44 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
> @@ -382,11 +382,6 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>         int r, i, j;
>         struct amdgpu_kiq *kiq = &adev->gfx.kiq[xcc_id];
>         struct amdgpu_ring *ring = &kiq->ring;
> -       u32 domain = AMDGPU_GEM_DOMAIN_GTT;
> -
> -       /* Only enable on gfx10 and 11 for now to avoid changing behavior on older chips */
> -       if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 0, 0))
> -               domain |= AMDGPU_GEM_DOMAIN_VRAM;

Just removing the addition of the AMDGPU_GEM_DOMAIN_VRAM domain here
will revert the behavior.  Since this is an important optimization and
we aren't seeing any issues on x86, I'd prefer to just limit your arch
to GTT if we can't resolve it some other way.

Try patch 1 and if that doesn't work we can fall back to some variant
of patch 2.

Alex

>
>         /* create MQD for KIQ */
>         if (!adev->enable_mes_kiq && !ring->mqd_obj) {
> @@ -421,7 +416,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>                         ring = &adev->gfx.gfx_ring[i];
>                         if (!ring->mqd_obj) {
>                                 r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                                           domain, &ring->mqd_obj,
> +                                                           AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                                             &ring->mqd_gpu_addr, &ring->mqd_ptr);
>                                 if (r) {
>                                         dev_warn(adev->dev, "failed to create ring mqd bo (%d)", r);
> @@ -445,7 +440,7 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
>                 ring = &adev->gfx.compute_ring[j];
>                 if (!ring->mqd_obj) {
>                         r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                                   domain, &ring->mqd_obj,
> +                                                   AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                                     &ring->mqd_gpu_addr, &ring->mqd_ptr);
>                         if (r) {
>                                 dev_warn(adev->dev, "failed to create ring mqd bo (%d)", r);
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
> index eb06d749876f..080e7eb3f98d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v10_1.c
> @@ -898,7 +898,6 @@ static int mes_v10_1_mqd_sw_init(struct amdgpu_device *adev,
>                 return 0;
>
>         r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                   AMDGPU_GEM_DOMAIN_VRAM |
>                                     AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                     &ring->mqd_gpu_addr, &ring->mqd_ptr);
>         if (r) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> index 6827d547042e..0608710306b8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
> @@ -1004,7 +1004,6 @@ static int mes_v11_0_mqd_sw_init(struct amdgpu_device *adev,
>                 return 0;
>
>         r = amdgpu_bo_create_kernel(adev, mqd_size, PAGE_SIZE,
> -                                   AMDGPU_GEM_DOMAIN_VRAM |
>                                     AMDGPU_GEM_DOMAIN_GTT, &ring->mqd_obj,
>                                     &ring->mqd_gpu_addr, &ring->mqd_ptr);
>         if (r) {
> --
> 2.42.0
>
From 53219f8cbcf50521850c542163132678b2114265 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Thu, 26 Oct 2023 14:47:57 -0400
Subject: [PATCH 1/2] drm/amdgpu/gfx10,11: use memcpy_to/fromio for MQDs

Since they were moved to VRAM, we need to use the IO
variants of memcpy.

Fixes: 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM")
Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 12 ++++++------
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 12 ++++++------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 9032d7a24d7c..306252cd67fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -6457,11 +6457,11 @@ static int gfx_v10_0_gfx_init_queue(struct amdgpu_ring *ring)
 		nv_grbm_select(adev, 0, 0, 0, 0);
 		mutex_unlock(&adev->srbm_mutex);
 		if (adev->gfx.me.mqd_backup[mqd_idx])
-			memcpy(adev->gfx.me.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
+			memcpy_fromio(adev->gfx.me.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
 	} else {
 		/* restore mqd with the backup copy */
 		if (adev->gfx.me.mqd_backup[mqd_idx])
-			memcpy(mqd, adev->gfx.me.mqd_backup[mqd_idx], sizeof(*mqd));
+			memcpy_toio(mqd, adev->gfx.me.mqd_backup[mqd_idx], sizeof(*mqd));
 		/* reset the ring */
 		ring->wptr = 0;
 		*ring->wptr_cpu_addr = 0;
@@ -6735,7 +6735,7 @@ static int gfx_v10_0_kiq_init_queue(struct amdgpu_ring *ring)
 	if (amdgpu_in_reset(adev)) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.kiq[0].mqd_backup)
-			memcpy(mqd, adev->gfx.kiq[0].mqd_backup, sizeof(*mqd));
+			memcpy_toio(mqd, adev->gfx.kiq[0].mqd_backup, sizeof(*mqd));
 
 		/* reset ring buffer */
 		ring->wptr = 0;
@@ -6758,7 +6758,7 @@ static int gfx_v10_0_kiq_init_queue(struct amdgpu_ring *ring)
 		mutex_unlock(&adev->srbm_mutex);
 
 		if (adev->gfx.kiq[0].mqd_backup)
-			memcpy(adev->gfx.kiq[0].mqd_backup, mqd, sizeof(*mqd));
+			memcpy_fromio(adev->gfx.kiq[0].mqd_backup, mqd, sizeof(*mqd));
 	}
 
 	return 0;
@@ -6779,11 +6779,11 @@ static int gfx_v10_0_kcq_init_queue(struct amdgpu_ring *ring)
 		mutex_unlock(&adev->srbm_mutex);
 
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
-			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
+			memcpy_fromio(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
 	} else {
 		/* restore MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
-			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(*mqd));
+			memcpy_toio(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(*mqd));
 		/* reset ring buffer */
 		ring->wptr = 0;
 		atomic64_set((atomic64_t *)ring->wptr_cpu_addr, 0);
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 762d7a19f1be..43d066bc5245 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -3684,11 +3684,11 @@ static int gfx_v11_0_gfx_init_queue(struct amdgpu_ring *ring)
 		soc21_grbm_select(adev, 0, 0, 0, 0);
 		mutex_unlock(&adev->srbm_mutex);
 		if (adev->gfx.me.mqd_backup[mqd_idx])
-			memcpy(adev->gfx.me.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
+			memcpy_fromio(adev->gfx.me.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
 	} else {
 		/* restore mqd with the backup copy */
 		if (adev->gfx.me.mqd_backup[mqd_idx])
-			memcpy(mqd, adev->gfx.me.mqd_backup[mqd_idx], sizeof(*mqd));
+			memcpy_toio(mqd, adev->gfx.me.mqd_backup[mqd_idx], sizeof(*mqd));
 		/* reset the ring */
 		ring->wptr = 0;
 		*ring->wptr_cpu_addr = 0;
@@ -3977,7 +3977,7 @@ static int gfx_v11_0_kiq_init_queue(struct amdgpu_ring *ring)
 	if (amdgpu_in_reset(adev)) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.kiq[0].mqd_backup)
-			memcpy(mqd, adev->gfx.kiq[0].mqd_backup, sizeof(*mqd));
+			memcpy_toio(mqd, adev->gfx.kiq[0].mqd_backup, sizeof(*mqd));
 
 		/* reset ring buffer */
 		ring->wptr = 0;
@@ -4000,7 +4000,7 @@ static int gfx_v11_0_kiq_init_queue(struct amdgpu_ring *ring)
 		mutex_unlock(&adev->srbm_mutex);
 
 		if (adev->gfx.kiq[0].mqd_backup)
-			memcpy(adev->gfx.kiq[0].mqd_backup, mqd, sizeof(*mqd));
+			memcpy_fromio(adev->gfx.kiq[0].mqd_backup, mqd, sizeof(*mqd));
 	}
 
 	return 0;
@@ -4021,11 +4021,11 @@ static int gfx_v11_0_kcq_init_queue(struct amdgpu_ring *ring)
 		mutex_unlock(&adev->srbm_mutex);
 
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
-			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
+			memcpy_fromio(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(*mqd));
 	} else {
 		/* restore MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
-			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(*mqd));
+			memcpy_toio(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(*mqd));
 		/* reset ring buffer */
 		ring->wptr = 0;
 		atomic64_set((atomic64_t *)ring->wptr_cpu_addr, 0);
-- 
2.41.0

From 5ae791c79de5cb055f381f9c27a1aec3b7f00162 Mon Sep 17 00:00:00 2001
From: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Thu, 26 Oct 2023 14:37:31 -0400
Subject: [PATCH 2/2] drm/amdgpu: don't put MQDs in VRAM on ARM | ARM64

Seems to be some coherency issue?

Fixes: 1cfb4d612127 ("drm/amdgpu: put MQDs in VRAM")
Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 2382921710ec..ef4cb921781d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -384,9 +384,11 @@ int amdgpu_gfx_mqd_sw_init(struct amdgpu_device *adev,
 	struct amdgpu_ring *ring = &kiq->ring;
 	u32 domain = AMDGPU_GEM_DOMAIN_GTT;
 
+#if !defined(CONFIG_ARM) && !defined(CONFIG_ARM64)
 	/* Only enable on gfx10 and 11 for now to avoid changing behavior on older chips */
 	if (adev->ip_versions[GC_HWIP][0] >= IP_VERSION(10, 0, 0))
 		domain |= AMDGPU_GEM_DOMAIN_VRAM;
+#endif
 
 	/* create MQD for KIQ */
 	if (!adev->enable_mes_kiq && !ring->mqd_obj) {
-- 
2.41.0


[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux