NAK, the fundamental problem was that we disabled the SDMA paging queue
during reset:
[ 885.694682] [drm] schedpage0 is not ready, skipping
[ 885.694682] [drm] schedpage1 is not ready, skipping
This is fixed by now, so the problem should not happen any more.
Regards,
Christian.
Am 06.05.20 um 11:36 schrieb Tiecheng Zhou:
WHY:
For V320 passthrough and "modprobe amdgpu lockup_timeout=500", there will be
kernel NULL pointer when using quark ~ BACO reset, for instance:
hang_vm_compute0_bad_cs_dispatch.lua
hang_vm_dma0_corrupted_header.lua
etc.
-----------------------------
[ 884.792885] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.0.0 timeout, signaled seq=3, emitted seq=4
[ 884.793772] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process quark pid 16939 thread quark pid 16940
[ 884.859979] amdgpu: [powerplay] set virtualization GFX DPM policy success
[ 884.861003] amdgpu: [powerplay] activate virtualization GFX DPM policy success
[ 884.861065] amdgpu: [powerplay] set virtualization VCE DPM policy success
[ 885.693554] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125!
[ 885.694682] [drm] schedpage0 is not ready, skipping
[ 885.694682] [drm] schedpage1 is not ready, skipping
[ 885.694720] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2)
[ 885.695328] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 885.695909] PGD 0 P4D 0
[ 885.696104] Oops: 0000 [#1] SMP PTI
[ 885.696368] CPU: 2 PID: 16940 Comm: quark Tainted: G OE 4.19.52+ #6
[ 885.696945] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 885.697593] RIP: 0010:amdgpu_vm_sdma_commit+0x59/0x130 [amdgpu]
...
[ 885.705042] Call Trace:
[ 885.705251] ? amdgpu_vm_bo_update_mapping+0xdf/0xf0 [amdgpu]
[ 885.705696] ? amdgpu_vm_clear_freed+0xcc/0x1b0 [amdgpu]
[ 885.706112] ? amdgpu_gem_va_ioctl+0x4a1/0x510 [amdgpu]
[ 885.706493] ? __radix_tree_delete+0x7e/0xa0
[ 885.706822] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[ 885.707220] ? drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 885.707568] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[ 885.707962] ? drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 885.708294] ? drm_ioctl+0x3a7/0x3f0 [drm]
[ 885.708632] ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
[ 885.709032] ? unmap_region+0xd9/0x120
[ 885.709328] ? amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 885.709684] ? do_vfs_ioctl+0xa1/0x620
[ 885.709971] ? do_munmap+0x32e/0x430
[ 885.710232] ? ksys_ioctl+0x66/0x70
[ 885.710513] ? __x64_sys_ioctl+0x16/0x20
[ 885.710806] ? do_syscall_64+0x55/0x100
[ 885.711092] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
...
[ 885.719408] ---[ end trace 7ee3180f42e9f572 ]---
[ 885.719766] RIP: 0010:amdgpu_vm_sdma_commit+0x59/0x130 [amdgpu]
...
-----------------------------
the NULL pointer (entity->rq == NULL in amdgpu_vm_sdma_commit()) as follows:
1. quark sends bad job that triggers job timeout;
2. guest KMD detects the job timeout and goes to gpu recovery, and it goes to
ip_suspend for SDMA, and it sets sdma[].sched.ready to false;
3. quark sends UNMAP operation through amdgpu_gem_va_ioctl, and guest KMD goes
through amdgpu_gem_va_update_vm and finally goes to amdgpu_vm_sdma_commit,
it goes to amdgpu_job_submit to drm_sched_job_init
4. drm_sched_job_init fails at drm_sched_pick_best() since
sdma[].sched.ready is set to false; in the meanwhile entity->rq becomes NULL;
5. quark sends other UNMAP operations through amdgpu_gem_va_ioctl, while this time
there will be NULL pointer because entity->rq is NULL;
the above sequence occurs only when "modprobe amdgpu lockup_timeout=500".
it does not occur when lockup_timeout=10000 (default) because step 2. KMD detects
job timeout will be sometime after quark sends UNMAP operations; i.e. quark UNMAP
opeartions are finished before sdma ip suspend.
HOW:
here is to add mutex_lock to wait to avoid using sdma during gpu reset.
Signed-off-by: Tiecheng Zhou <Tiecheng.Zhou@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index e205ecc75a21..018b88f3b6da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2047,6 +2047,8 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
struct dma_fence *f = NULL;
int r;
+ mutex_lock(&adev->lock_reset);
+
while (!list_empty(&vm->freed)) {
mapping = list_first_entry(&vm->freed,
struct amdgpu_bo_va_mapping, list);
@@ -2062,6 +2064,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
amdgpu_vm_free_mapping(adev, vm, mapping, f);
if (r) {
dma_fence_put(f);
+ mutex_unlock(&adev->lock_reset);
return r;
}
}
@@ -2073,6 +2076,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
dma_fence_put(f);
}
+ mutex_unlock(&adev->lock_reset);
return 0;
}
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx