[AMD Official Use Only - Internal Distribution Only] Thanks, Christian, Is this the fix that you are mentioning: commit 1675c3a24d075d484377003789245f48c2114a0b Author: Christian König <christian.koenig@xxxxxxx> Date: Fri Feb 21 15:10:31 2020 +0100 drm/amdgpu: stop disable the scheduler during HW fini When we stop the HW for example for GPU reset we should not stop the front-end scheduler. Otherwise we run into intermediate failures during command submission. The scheduler should only be stopped in very few cases: 1. We can't get the hardware working in ring or IB test after a GPU reset. 2. The KIQ scheduler is not used in the front-end and should be disabled during GPU reset. 3. In amdgpu_ring_fini() when the driver unloads. Signed-off-by: Christian König <christian.koenig@xxxxxxx> Reviewed-by: Alex Deucher <alexander.deucher@xxxxxxx> Acked-by: Nirmoy Das <nirmoy.das@xxxxxxx> Test-by: Dennis Li <dennis.li@xxxxxxx> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> Thanks Tiecheng -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Wednesday, May 6, 2020 5:44 PM To: Zhou, Tiecheng <Tiecheng.Zhou@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/amdgpu: avoid clearing freed bo with sdma in gpu reset NAK, the fundamental problem was that we disabled the SDMA paging queue during reset: > [ 885.694682] [drm] schedpage0 is not ready, skipping [ 885.694682] > [drm] schedpage1 is not ready, skipping This is fixed by now, so the problem should not happen any more. Regards, Christian. Am 06.05.20 um 11:36 schrieb Tiecheng Zhou: > WHY: > For V320 passthrough and "modprobe amdgpu lockup_timeout=500", there > will be kernel NULL pointer when using quark ~ BACO reset, for instance: > hang_vm_compute0_bad_cs_dispatch.lua > hang_vm_dma0_corrupted_header.lua > etc. > ----------------------------- > [ 884.792885] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring > comp_1.0.0 timeout, signaled seq=3, emitted seq=4 [ 884.793772] > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: > process quark pid 16939 thread quark pid 16940 [ 884.859979] amdgpu: > [powerplay] set virtualization GFX DPM policy success [ 884.861003] > amdgpu: [powerplay] activate virtualization GFX DPM policy success [ 884.861065] amdgpu: [powerplay] set virtualization VCE DPM policy success [ 885.693554] [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser -125! > [ 885.694682] [drm] schedpage0 is not ready, skipping [ 885.694682] > [drm] schedpage1 is not ready, skipping [ 885.694720] > [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2) > [ 885.695328] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000008 [ 885.695909] PGD 0 P4D 0 [ 885.696104] Oops: > 0000 [#1] SMP PTI > [ 885.696368] CPU: 2 PID: 16940 Comm: quark Tainted: G OE 4.19.52+ #6 > [ 885.696945] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.10.2-1 04/01/2014 [ 885.697593] RIP: > 0010:amdgpu_vm_sdma_commit+0x59/0x130 [amdgpu] ... > [ 885.705042] Call Trace: > [ 885.705251] ? amdgpu_vm_bo_update_mapping+0xdf/0xf0 [amdgpu] [ > 885.705696] ? amdgpu_vm_clear_freed+0xcc/0x1b0 [amdgpu] [ > 885.706112] ? amdgpu_gem_va_ioctl+0x4a1/0x510 [amdgpu] [ 885.706493] > ? __radix_tree_delete+0x7e/0xa0 [ 885.706822] ? > amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] [ 885.707220] ? > drm_ioctl_kernel+0xaa/0xf0 [drm] [ 885.707568] ? > amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] [ 885.707962] ? > drm_ioctl_kernel+0xaa/0xf0 [drm] [ 885.708294] ? > drm_ioctl+0x3a7/0x3f0 [drm] [ 885.708632] ? > amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu] [ 885.709032] ? > unmap_region+0xd9/0x120 [ 885.709328] ? amdgpu_drm_ioctl+0x49/0x80 > [amdgpu] [ 885.709684] ? do_vfs_ioctl+0xa1/0x620 [ 885.709971] ? > do_munmap+0x32e/0x430 [ 885.710232] ? ksys_ioctl+0x66/0x70 [ > 885.710513] ? __x64_sys_ioctl+0x16/0x20 [ 885.710806] ? > do_syscall_64+0x55/0x100 [ 885.711092] ? > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > ... > [ 885.719408] ---[ end trace 7ee3180f42e9f572 ]--- [ 885.719766] > RIP: 0010:amdgpu_vm_sdma_commit+0x59/0x130 [amdgpu] ... > ----------------------------- > > the NULL pointer (entity->rq == NULL in amdgpu_vm_sdma_commit()) as follows: > 1. quark sends bad job that triggers job timeout; 2. guest KMD detects > the job timeout and goes to gpu recovery, and it goes to > ip_suspend for SDMA, and it sets sdma[].sched.ready to false; 3. > quark sends UNMAP operation through amdgpu_gem_va_ioctl, and guest KMD goes > through amdgpu_gem_va_update_vm and finally goes to amdgpu_vm_sdma_commit, > it goes to amdgpu_job_submit to drm_sched_job_init 4. > drm_sched_job_init fails at drm_sched_pick_best() since > sdma[].sched.ready is set to false; in the meanwhile entity->rq > becomes NULL; 5. quark sends other UNMAP operations through amdgpu_gem_va_ioctl, while this time > there will be NULL pointer because entity->rq is NULL; > > the above sequence occurs only when "modprobe amdgpu lockup_timeout=500". > it does not occur when lockup_timeout=10000 (default) because step 2. > KMD detects job timeout will be sometime after quark sends UNMAP > operations; i.e. quark UNMAP opeartions are finished before sdma ip suspend. > > HOW: > here is to add mutex_lock to wait to avoid using sdma during gpu reset. > > Signed-off-by: Tiecheng Zhou <Tiecheng.Zhou@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > index e205ecc75a21..018b88f3b6da 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c > @@ -2047,6 +2047,8 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, > struct dma_fence *f = NULL; > int r; > > + mutex_lock(&adev->lock_reset); > + > while (!list_empty(&vm->freed)) { > mapping = list_first_entry(&vm->freed, > struct amdgpu_bo_va_mapping, list); @@ -2062,6 +2064,7 @@ int > amdgpu_vm_clear_freed(struct amdgpu_device *adev, > amdgpu_vm_free_mapping(adev, vm, mapping, f); > if (r) { > dma_fence_put(f); > + mutex_unlock(&adev->lock_reset); > return r; > } > } > @@ -2073,6 +2076,7 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, > dma_fence_put(f); > } > > + mutex_unlock(&adev->lock_reset); > return 0; > > } _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx