On Tue, Aug 11, 2020 at 4:12 AM Dennis Li <Dennis.Li@xxxxxxx> wrote: > > [ 584.110304] ============================================ > [ 584.110590] WARNING: possible recursive locking detected > [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE > [ 584.111164] -------------------------------------------- > [ 584.111456] kworker/38:1/553 is trying to acquire lock: > [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.112112] > but task is already holding lock: > [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.113068] > other info that might help us debug this: > [ 584.113689] Possible unsafe locking scenario: > > [ 584.114350] CPU0 > [ 584.114685] ---- > [ 584.115014] lock(&adev->reset_sem); > [ 584.115349] lock(&adev->reset_sem); > [ 584.115678] > *** DEADLOCK *** > > [ 584.116624] May be due to missing lock nesting notation > > [ 584.117284] 4 locks held by kworker/38:1/553: > [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 > [ 584.117967] #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 > [ 584.118358] #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] > [ 584.118786] #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.119222] > stack backtrace: > [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 > [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 > [ 584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu] > [ 584.121638] Call Trace: > [ 584.122050] dump_stack+0x98/0xd5 > [ 584.122499] __lock_acquire+0x1139/0x16e0 > [ 584.122931] ? trace_hardirqs_on+0x3b/0xf0 > [ 584.123358] ? cancel_delayed_work+0xa6/0xc0 > [ 584.123771] lock_acquire+0xb8/0x1c0 > [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.124599] down_write+0x49/0x120 > [ 584.125032] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.125472] amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] > [ 584.125910] ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] > [ 584.126367] amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] > [ 584.126789] process_one_work+0x29e/0x630 > [ 584.127208] worker_thread+0x3c/0x3f0 > [ 584.127621] ? __kthread_parkme+0x61/0x90 > [ 584.128014] kthread+0x12f/0x150 > [ 584.128402] ? process_one_work+0x630/0x630 > [ 584.128790] ? kthread_park+0x90/0x90 > [ 584.129174] ret_from_fork+0x3a/0x50 > > Each adev has owned lock_class_key to avoid false positive > recursive locking. > > v2: > 1. register adev->lock_key into lockdep, otherwise lockdep will > report the below warning > > [ 1216.705820] BUG: key ffff890183b647d0 has not been registered! > [ 1216.705924] ------------[ cut here ]------------ > [ 1216.705972] DEBUG_LOCKS_WARN_ON(1) > [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 > > v3: > change to use down_write_nest_lock to annotate the false dead-lock > warning. > > Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> I'd explain a bit more in writing what the actual deadlock possibility is, but up to you folks. Reviewed-by: Daniel Vetter <daniel.vetter@xxxxxxxx> Cheers, Daniel > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 62ecac97fbd2..8a55b0bc044a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, > return r; > } > > -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev) > +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, struct amdgpu_hive_info *hive) > { > if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0) > return false; > > - down_write(&adev->reset_sem); > + if (hive) { > + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); > + } else > + down_write(&adev->reset_sem); > > atomic_inc(&adev->gpu_reset_counter); > switch (amdgpu_asic_reset_method(adev)) { > @@ -4312,7 +4315,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > > /* block all schedulers and reset given job's ring */ > list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { > - if (!amdgpu_device_lock_adev(tmp_adev)) { > + if (!amdgpu_device_lock_adev(tmp_adev, hive)) { > DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", > job ? job->base.id : -1); > r = 0; > -- > 2.17.1 > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx