> 2020年3月25日 17:23,Pan, Xinhui <Xinhui.Pan@xxxxxxx> 写道: > > > >> 2020年3月25日 15:48,Koenig, Christian <Christian.Koenig@xxxxxxx> 写道: >> >> Am 25.03.20 um 06:47 schrieb xinhui pan: >>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might >>> set NULL to rq. So add a check like drm_sched_job_init does. >> >> NAK, the rq should never be set to NULL in the first place. >> >> How did that happened? > > well, I have not check the details. so recovery will disable sdma ring. the sched->ready will be false then. any job submitted during suspend and resume will meet this issue. [ 99.011614] amdgpu 0000:03:00.0: GPU reset begin! [ 99.265504] CPU: 5 PID: 163 Comm: kworker/5:1 Tainted: G W 5.4.0-rc7+ #1 [ 99.273659] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 [ 99.282522] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 99.288682] Call Trace: [ 99.291193] dump_stack+0x98/0xd5 [ 99.294629] sdma_v5_0_enable+0x1ab/0x1d0 [amdgpu] [ 99.299563] sdma_v5_0_suspend+0x2a/0x30 [amdgpu] [ 99.304360] amdgpu_device_ip_suspend_phase2+0xa3/0x110 [amdgpu] [ 99.310504] ? amdgpu_device_ip_suspend_phase1+0x5b/0xe0 [amdgpu] [ 99.316727] amdgpu_device_ip_suspend+0x37/0x60 [amdgpu] [ 99.322159] amdgpu_device_pre_asic_reset+0x81/0x1f0 [amdgpu] [ 99.328054] amdgpu_device_gpu_recover+0x27f/0xc60 [amdgpu] [ 99.333767] amdgpu_job_timedout+0x123/0x140 [amdgpu] [ 99.338898] drm_sched_job_timedout+0x85/0xe0 [gpu_sched] [ 99.344445] ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu] [ 99.350145] ? drm_sched_job_timedout+0x85/0xe0 [gpu_sched] [ 99.355834] process_one_work+0x231/0x5c0 [ 99.359927] worker_thread+0x3f/0x3b0 [ 99.363641] ? __kthread_parkme+0x61/0x90 [ 99.367701] kthread+0x12c/0x150 [ 99.371010] ? process_one_work+0x5c0/0x5c0 [ 99.375318] ? kthread_park+0x90/0x90 [ 99.379042] ret_from_fork+0x3a/0x50 > but just got the call trace below. > looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL. > in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq. > > 297567 [ 44.667677] amdgpu 0000:03:00.0: GPU reset begin! > 297568 [ 44.929047] [drm] scheduler sdma0 is not ready, skipping > 297569 [ 44.929048] [drm] scheduler sdma1 is not ready, skipping > 297570 [ 44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2) > 297571 [ 44.947941] BUG: kernel NULL pointer dereference, address: 0000000000000038 > 297572 [ 44.955132] #PF: supervisor read access in kernel mode > 297573 [ 44.960451] #PF: error_code(0x0000) - not-present page > 297574 [ 44.965714] PGD 0 P4D 0 > 297575 [ 44.968331] Oops: 0000 [#1] SMP PTI > 297576 [ 44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: G W 5.4.0-rc7+ #1 > 297577 [ 44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 > 297578 [ 44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu] > 297579 [ 44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f f 0f 84 06 01 00 00 48 8b 80 > 297580 [ 45.014931] RSP: 0018:ffffb66e008839d0 EFLAGS: 00010246 > 297581 [ 45.020504] RAX: 0000000000000000 RBX: ffffb66e00883a30 RCX: 0000000000100400 > 297582 [ 45.028062] RDX: 000000000000003c RSI: ffff8df123662138 RDI: ffffb66e00883a30 > 297583 [ 45.035662] RBP: ffffb66e00883a00 R08: ffffb66e0088395c R09: ffffb66e00883960 > 297584 [ 45.043298] R10: 0000000000100240 R11: 0000000000000035 R12: ffff8df1425385e8 > 297585 [ 45.050916] R13: ffff8df13cfd1288 R14: ffff8df123662138 R15: ffff8df13cfd1000 > 297586 [ 45.058524] FS: 00007fcc8f6b2100(0000) GS:ffff8df15e380000(0000) knlGS:0000000000000000 > 297587 [ 45.067114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > 297588 [ 45.073206] CR2: 0000000000000038 CR3: 0000000641fb6006 CR4: 00000000003606e0 > 297589 [ 45.080791] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > 297590 [ 45.088277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > 297591 [ 45.095773] Call Trace: > 297592 [ 45.098354] amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu] > 297593 [ 45.104427] ? mark_held_locks+0x4d/0x80 > 297594 [ 45.108682] amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu] > 297595 [ 45.114049] ? rcu_read_lock_sched_held+0x4f/0x80 > 297596 [ 45.119111] amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu] > 297597 [ 45.124495] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] > 297598 [ 45.130250] drm_ioctl_kernel+0xb0/0x100 [drm] > 297599 [ 45.134988] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] > 297600 [ 45.140742] ? drm_ioctl_kernel+0xb0/0x100 [drm] > 297601 [ 45.145622] drm_ioctl+0x389/0x450 [drm] > 297602 [ 45.149804] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] > 297603 [ 45.155551] ? trace_hardirqs_on+0x3b/0xf0 > 297604 [ 45.159892] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] > 297605 [ 45.172104] do_vfs_ioctl+0xa9/0x6f0 > 297606 [ 45.175909] ? tomoyo_file_ioctl+0x19/0x20 > 297607 [ 45.180241] ksys_ioctl+0x75/0x80 > 297608 [ 45.183760] ? do_syscall_64+0x17/0x230 > 297609 [ 45.187833] __x64_sys_ioctl+0x1a/0x20 > 297610 [ 45.191846] do_syscall_64+0x5f/0x230 > 297611 [ 45.195764] entry_SYSCALL_64_after_hwframe+0x49/0xbe > 297612 [ 45.201126] RIP: 0033:0x7fcc8c7725d7 > >> >> Regards, >> Christian. >> >>> >>> Cc: Christian König <christian.koenig@xxxxxxx> >>> Cc: Alex Deucher <alexander.deucher@xxxxxxx> >>> Cc: Felix Kuehling <Felix.Kuehling@xxxxxxx> >>> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>> index cf96c335b258..d30d103e48a2 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c >>> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, >>> int r; >>> entity = p->direct ? &p->vm->direct : &p->vm->delayed; >>> + if (!entity->rq) >>> + return -ENOENT; >>> ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); >>> WARN_ON(ib->length_dw == 0); >> > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx