Re: [PATCH] Revert "drm/radeon: use GEM references instead of TTMs"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yes that is a known issue with the driver at the moment.

It needs a three line change to init the GEM functions earlier than before. I'm currently working on this fix.

Regards,
Christian.

Am 01.10.24 um 15:50 schrieb Mingcong Bai:
Hi Huacai,

在 2024-09-29 15:50,Huacai Chen 写道:
This reverts commit fd69ef05029f9beb7b031ef96e7a36970806a670.

The original patch causes NULL pointer references:

[   21.620856] CPU 3 Unable to handle kernel paging request at virtual address 0000000000000000, era == 9000000004bf61d8, ra == 9000000004bf61d4
[   21.717958] Oops[#1]:
[   21.803205] CPU: 3 UID: 0 PID: 706 Comm: Xorg Not tainted 6.11.0+ #1708 [   21.894451] Hardware name: Loongson Loongson-3A5000-7A1000-1w-CRB/Loongson-LS3A5000-7A1000-1w-CRB, BIOS vUDK2018-LoongArch-V2.0.0-prebeta9 10/21/2022 [   21.996576] pc 9000000004bf61d8 ra 9000000004bf61d4 tp 9000000110560000 sp 9000000110563d40 [   22.094731] a0 000000000000002d a1 9000000000580788 a2 9000000000584d78 a3 9000000005678f40 [   22.193513] a4 9000000005678f38 a5 9000000110563b70 a6 0000000000000001 a7 0000000000000001 [   22.291993] t0 0000000000000000 t1 78315f0d31fceafb t2 0000000000000000 t3 00000000000003c4 [   22.389868] t4 9000000101d65840 t5 0000000000000003 t6 0000000000000003 t7 ffffffffffffffff [   22.488326] t8 0000000000000001 u0 9000000120c31e20 s9 9000000110563ec0 s0 90000001107e0868 [   22.587345] s1 ffff80000230c000 s2 9000000120c31e48 s3 9000000120c31e00 s4 90000001063b0000 [   22.685908] s5 9000000120c31e20 s6 0000000000000122 s7 0000000000000100 s8 000055555c079570
[   22.785169]    ra: 9000000004bf61d4 drm_gem_object_free+0x24/0x70
[   22.881896]   ERA: 9000000004bf61d8 drm_gem_object_free+0x28/0x70
[   22.978212]  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[   23.076423]  PRMD: 00000004 (PPLV0 +PIE -PWE)
[   23.153679] [drm] amdgpu kernel modesetting enabled.
[   23.173074]  EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[   23.365633]  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
[   23.459680] ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
[   23.554473]  BADV: 0000000000000000
[   23.646222]  PRID: 0014c010 (Loongson-64bit, Loongson-3A5000)
[   23.740356] Modules linked in: amdgpu rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct drm_exec amdxcps [   23.973584] Process Xorg (pid: 706, threadinfo=000000005fc343eb, task=000000007bdfdf49) [   24.080528] Stack : 9000000120d86000 ffff8000021bb1c0 0000000000000000 ffff8000022a6bcc [   24.188191]         0000000000000122 9000000120c31d08 900000010e04a400 9000000120c31e00 [   24.295420]         90000001063b0008 9000000120c31c00 90000001063b0000 ffff80000219c54c [   24.402622]         00000000000000b4 90000001063b0170 90000001063b0008 9000000120c31c00 [   24.509242]         9000000120c31ce0 90000000043966f8 000055555c0922c0 000055555c082ac0 [   24.615887]         000055555597b000 0000000000000000 90000001034af840 90000001063f7928 [   24.723086]         90000001063b00d0 9000000120c31c00 90000001063b0008 9000000004396844 [   24.830582]         90000001017901a0 90000001017901a0 900000010e7e6718 00000000000a001b [   24.937455]         90000001228b86c0 9000000003ad5904 000055555c082da0 0000000000000000 [   25.043806]         000055555c082ac0 90000001228b86c0 0000000000000000 9000000003acfb58
[   25.149701]         ...
[   25.248708] Call Trace:
[   25.248710] [<9000000004bf61d8>] drm_gem_object_free+0x28/0x70
[   25.447554] [<ffff8000021bb1bc>] radeon_bo_unref+0x3c/0x60 [radeon]
[   25.549201] [<ffff8000022a6bc8>] radeon_vm_fini+0x188/0x2c0 [radeon]
[   25.650751] [<ffff80000219c548>] radeon_driver_postclose_kms+0x188/0x1e0 [radeon]
[   25.753856] [<90000000043966f4>] drm_file_free+0x214/0x2a0
[   25.854893] [<9000000004396840>] drm_release+0xc0/0x160
[   25.954337] [<9000000003ad5900>] __fput+0x100/0x340
[   26.052437] [<9000000003acfb54>] sys_close+0x34/0xa0
[   26.148701] [<9000000004c04170>] do_syscall+0xb0/0x160


This appears to be a non-LoongArch specific issue as I was able to reproduce this on my Intel platform (H310 chipset, Pentium Gold G5620) with an AMD Radeon R7 240 (Oland) connected via HDMI.

Happy to provide more testing results if needed, but below is the log where the crash occurred:

kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 0 P4D 0
kernel: Oops: Oops: 0000 [#1] PREEMPT SMP PTI
kernel: CPU: 3 UID: 0 PID: 952 Comm: ddcutil Not tainted 6.11.0-aosc-main-11993-g3efc57369a0c #1 kernel: Hardware name: System manufacturer System Product Name/PRIME H310M-F R2.0, BIOS 1401 03/31/2020
kernel: RIP: 0010:drm_gem_object_free+0x10/0x30
kernel: Code: 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 87 40 01 00 00 <48> 8b 00 48 85 c0 74 06 ff e0 cc 66 90 cc 0f 0b 31 >
kernel: RSP: 0018:ffffb0f300b23de8 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff918b0487a000 RCX: 000000000000000c
kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff918b1eee2468
kernel: RBP: ffff918b197d9000 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff918b179cc000
kernel: R13: ffff918b03ee0800 R14: ffff918b197d9048 R15: ffff918b197d92e0
kernel: FS:  00007ffb58033b80(0000) GS:ffff918b32d80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000000 CR3: 000000011eda4005 CR4: 00000000003706f0
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? __die+0x23/0x80
kernel:  ? page_fault_oops+0x14f/0x560
kernel:  ? exc_page_fault+0x84/0x1c0
kernel:  ? asm_exc_page_fault+0x26/0x30
kernel:  ? drm_gem_object_free+0x10/0x30
kernel:  radeon_bo_unref+0x64/0x80 [radeon]
kernel:  radeon_vm_fini+0x1d0/0x260 [radeon]
kernel:  radeon_driver_postclose_kms+0x164/0x190 [radeon]
kernel:  drm_file_free+0x1f3/0x250
kernel:  drm_release+0xaa/0x120
kernel:  __fput+0xdc/0x2a0
kernel:  __x64_sys_close+0x3c/0x80
kernel:  do_syscall_64+0x64/0x150
kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
kernel: RIP: 0033:0x7ffb57ef9430
kernel: Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 39 8f 11 00 00 74 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 > kernel: RSP: 002b:00007ffd59048868 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007ffb57ef9430
kernel: RDX: 000000055c96b7fe RSI: 0000000000000001 RDI: 0000000000000003
kernel: RBP: 0000000000000001 R08: 0000000000000007 R09: 000055c96b7fe430
kernel: R10: a563eae46f2f347c R11: 0000000000000202 R12: 0000000000000000
kernel: R13: 000055c9634e44b8 R14: 0000000000000010 R15: 000055c96347e698
kernel:  </TASK>
kernel: Modules linked in: joydev mousedev input_leds snd_soc_avs snd_soc_hda_codec snd_hda_ext_core intel_rapl_msr iTCO_wdt intel_rapl_common intel_pmc_bxt intel_uncore_frequency snd_soc_core > kernel:  drm_ttm_helper ttm video wmi hid_logitech_dj hid_generic sunrpc coretemp i2c_dev
kernel: CR2: 0000000000000000
kernel: ---[ end trace 0000000000000000 ]---
kernel: RIP: 0010:drm_gem_object_free+0x10/0x30
kernel: Code: 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 48 8b 87 40 01 00 00 <48> 8b 00 48 85 c0 74 06 ff e0 cc 66 90 cc 0f 0b 31 >
kernel: RSP: 0018:ffffb0f300b23de8 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffff918b0487a000 RCX: 000000000000000c
kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff918b1eee2468
kernel: RBP: ffff918b197d9000 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff918b179cc000
kernel: R13: ffff918b03ee0800 R14: ffff918b197d9048 R15: ffff918b197d92e0
kernel: FS:  00007ffb58033b80(0000) GS:ffff918b32d80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000000 CR3: 000000011eda4005 CR4: 00000000003706f0

The root cause is obj->funcs is NULL in drm_gem_object_free(). Only
fbdev bo is created by radeon_gem_object_create() and has valid 'funcs'.

Maybe there is a better way to fix this bug, but since amdgpu driver
also use ttm helpers in amdgpu_bo_ref()/amdgpu_bo_unref() now, I think
it is also reasonable to just revert the original commit.
---
 drivers/gpu/drm/radeon/radeon_gem.c    | 2 +-
 drivers/gpu/drm/radeon/radeon_object.c | 7 +++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c
index 9735f4968b86..210e8d43bb23 100644
--- a/drivers/gpu/drm/radeon/radeon_gem.c
+++ b/drivers/gpu/drm/radeon/radeon_gem.c
@@ -88,7 +88,7 @@ static void radeon_gem_object_free(struct drm_gem_object *gobj)

     if (robj) {
         radeon_mn_unregister(robj);
-        ttm_bo_put(&robj->tbo);
+        radeon_bo_unref(&robj);
     }
 }

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index d0e4b43d155c..450ff7daa46c 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -256,15 +256,18 @@ struct radeon_bo *radeon_bo_ref(struct radeon_bo *bo)
     if (bo == NULL)
         return NULL;

-    drm_gem_object_get(&bo->tbo.base);
+    ttm_bo_get(&bo->tbo);
     return bo;
 }

 void radeon_bo_unref(struct radeon_bo **bo)
 {
+    struct ttm_buffer_object *tbo;
+
     if ((*bo) == NULL)
         return;
-    drm_gem_object_put(&(*bo)->tbo.base);
+    tbo = &((*bo)->tbo);
+    ttm_bo_put(tbo);
     *bo = NULL;
 }

Best Regards,
Mingcong Bai




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux