[AMD Public Use] Hi Christian, Since it's out of amdgpu driver scope, so I just attached my patch in this email thread instead of sending it by git sent-email. Regards, Guchun -----Original Message----- From: Koenig, Christian <Christian.Koenig@xxxxxxx> Sent: Tuesday, August 11, 2020 3:52 PM To: Chen, Guchun <Guchun.Chen@xxxxxxx>; Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; daniel@xxxxxxxx Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive recursive locking Nice catch, it's not often that we stumble over something like this :) Can you provide a patch to fix this? Problem is that we probably won't be able to push it to the AMD servers, but I can probably merge it through drm-misc-next. Until this is fixed feel free to commit the patch with the {} in place. Thanks, Christian. Am 11.08.20 um 09:44 schrieb Chen, Guchun: > [AMD Public Use] > > # define down_write_nest_lock(sem, nest_lock) \ > do { \ > typecheck(struct lockdep_map *, &(nest_lock)->dep_map); \ > _down_write_nest_lock(sem, &(nest_lock)->dep_map); \ > } while (0); > > Looks the ';' after while (0) is the error point. It should be dropped. > > Regards, > Guchun > > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > Sent: Tuesday, August 11, 2020 3:41 PM > To: Chen, Guchun <Guchun.Chen@xxxxxxx>; Li, Dennis > <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander > <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; > Zhang, Hawking <Hawking.Zhang@xxxxxxx>; daniel@xxxxxxxx > Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive > recursive locking > > Am 11.08.20 um 09:36 schrieb Chen, Guchun: >> [AMD Public Use] >> >>> - down_write(&adev->reset_sem); >>> + if (hive) { >>> + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); >>> + } else >>> + down_write(&adev->reset_sem); >> Coding style nit pick: You should drop the {} here. >> >> {} could not be dropped here, as down_write_nest_lock Is one macro with multiple lines, otherwise, build error would say 'else' missed one previous 'if'. >> Instead of dropping, another {} should be added to else to include down_write(&adev->reset_sem), which makes the braces been balanced. > Interesting bug, that is something which should not happen with macros in the Linux kernel. They protection by "do { ... } while(0)" is mandatory here. > > Let me take a look, > Christian. > >> Regards, >> Guchun >> >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >> Christian König >> Sent: Tuesday, August 11, 2020 2:53 PM >> To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; >> Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix >> <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; >> daniel@xxxxxxxx >> Subject: Re: [PATCH v3] drm/amdgpu: annotate a false positive >> recursive locking >> >> Am 11.08.20 um 04:12 schrieb Dennis Li: >>> [ 584.110304] ============================================ >>> [ 584.110590] WARNING: possible recursive locking detected >>> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >>> [ 584.111164] -------------------------------------------- >>> [ 584.111456] kworker/38:1/553 is trying to acquire lock: >>> [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] >>> but task is already holding lock: >>> [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] >>> other info that might help us debug this: >>> [ 584.113689] Possible unsafe locking scenario: >>> >>> [ 584.114350] CPU0 >>> [ 584.114685] ---- >>> [ 584.115014] lock(&adev->reset_sem); >>> [ 584.115349] lock(&adev->reset_sem); >>> [ 584.115678] >>> *** DEADLOCK *** >>> >>> [ 584.116624] May be due to missing lock nesting notation >>> >>> [ 584.117284] 4 locks held by kworker/38:1/553: >>> [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, >>> at: process_one_work+0x21f/0x630 [ 584.117967] #1: >>> ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: >>> process_one_work+0x21f/0x630 [ 584.118358] #2: ffffffffc1c2a5d0 >>> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] >>> stack backtrace: >>> [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 >>> [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, >>> BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events >>> amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: >>> [ 584.122050] dump_stack+0x98/0xd5 [ 584.122499] >>> __lock_acquire+0x1139/0x16e0 [ 584.122931] ? >>> trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? >>> cancel_delayed_work+0xa6/0xc0 [ 584.123771] >>> lock_acquire+0xb8/0x1c0 [ 584.124197] ? >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.124599] down_write+0x49/0x120 [ 584.125032] ? >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] >>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? >>> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] >>> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] >>> process_one_work+0x29e/0x630 [ 584.127208] >>> worker_thread+0x3c/0x3f0 [ 584.127621] ? >>> __kthread_parkme+0x61/0x90 [ 584.128014] >>> kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ >>> 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] >>> ret_from_fork+0x3a/0x50 >>> >>> Each adev has owned lock_class_key to avoid false positive recursive >>> locking. >>> >>> v2: >>> 1. register adev->lock_key into lockdep, otherwise lockdep will >>> report the below warning >>> >>> [ 1216.705820] BUG: key ffff890183b647d0 has not been registered! >>> [ 1216.705924] ------------[ cut here ]------------ [ 1216.705972] >>> DEBUG_LOCKS_WARN_ON(1) [ 1216.705997] WARNING: CPU: 20 PID: 541 at >>> kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210 >>> >>> v3: >>> change to use down_write_nest_lock to annotate the false dead-lock >>> warning. >>> >>> Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 62ecac97fbd2..8a55b0bc044a 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -4145,12 +4145,15 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, >>> return r; >>> } >>> >>> -static bool amdgpu_device_lock_adev(struct amdgpu_device *adev) >>> +static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, >>> +struct amdgpu_hive_info *hive) >>> { >>> if (atomic_cmpxchg(&adev->in_gpu_reset, 0, 1) != 0) >>> return false; >>> >>> - down_write(&adev->reset_sem); >>> + if (hive) { >>> + down_write_nest_lock(&adev->reset_sem, &hive->hive_lock); >>> + } else >>> + down_write(&adev->reset_sem); >> Coding style nit pick: You should drop the {} here. >> >> Apart from that the patch is Reviewed-by: Christian König >> <christian.koenig@xxxxxxx> >> >>> >>> atomic_inc(&adev->gpu_reset_counter); >>> switch (amdgpu_asic_reset_method(adev)) { @@ -4312,7 +4315,7 @@ >>> int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>> >>> /* block all schedulers and reset given job's ring */ >>> list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) { >>> - if (!amdgpu_device_lock_adev(tmp_adev)) { >>> + if (!amdgpu_device_lock_adev(tmp_adev, hive)) { >>> DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", >>> job ? job->base.id : -1); >>> r = 0; >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flis >> t >> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cg >> u >> chun.chen%40amd.com%7C0a3fd06c0e774002a0e908d83dc337a0%7C3dd8961fe488 >> 4 >> e608e11a82d994e183d%7C0%7C0%7C637327255952065510&sdata=rJ%2BbWsWJ >> 6 >> NfPUinwUHxnRfVxYi%2Ft4JQCTtamMATrLws%3D&reserved=0
Attachment:
0001-locking-rwsem-remove-redundant-semicolon-of-down_wri.patch
Description: 0001-locking-rwsem-remove-redundant-semicolon-of-down_wri.patch
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx