Am 2020-08-12 um 4:53 a.m. schrieb Christian König: > Am 12.08.20 um 03:19 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Hi, Felix, >> >> Re: It may be better to fix it the other way around in >> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the >> reservation. Otherwise you will never be able to take the reset_sem >> while any BOs are reserved. That's probably going to cause you other >> problems later. >> [Dennis Li] Thanks that you find the potential issue, I will change >> it in version 2. >> >> Re: That makes me wonder, why do you need the reset_sem in >> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious >> hardware access in that function. Is it for amdgpu_ttm_alloc_gart >> updating the GART table through HDP? >> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have >> considered to only protect amdgpu_ttm_alloc_gart before. > > That access is irrelevant and the lock should be removed or changed > into a trylock. > > See we need the HDP flush only because the hardware could have > accessed the data before. > > But after a GPU reset the HDP is known to be clean, so this doesn't > need any protection. > >> But I worry other functions will access hardware in the future. >> Therefore I select an aggressive approach which lock reset_sem at the >> beginning of entry functions of amdgpu driver. > > This is not a good idea. We used to have such a global lock before and > removed it because it caused all kind of problems. In most cases it's a global read-lock, so most of the time it should not impact concurrency. The only "writer" that blocks all readers is the GPU reset. > > When was this added? Looks like it slipped under my radar or I wasn't > awake enough at that moment. The change that added the reset_sem added read-locks in about 70 places. I'm still concerned that this will be hard to maintain, to make sure future HW access will do the necessary locking. I guess that's the motivation for doing the locking more coarse-grained. Taking the lock higher up the call chains means fewer places that take the lock and new low-level code may not need its own locking in the future because it's already covered by higher-level callers. On the other hand, it needs to be low enough in the call chains to avoid recursive locking or circular dependencies with other locks. Regards, Felix > > Christian. > >> >> Best Regards >> Dennis Li >> -----Original Message----- >> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >> Sent: Tuesday, August 11, 2020 9:57 PM >> To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; >> Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking >> <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> >> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking >> dependency >> >> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: >>> [ 653.902305] ====================================================== >>> [ 653.902928] WARNING: possible circular locking dependency detected >>> [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: >>> G OE >>> [ 653.904098] ------------------------------------------------------ >>> [ 653.904675] amdgpu_test/3975 is trying to acquire lock: >>> [ 653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: >>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] >>> but task is already holding lock: >>> [ 653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, >>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] >>> which lock already depends on the new lock. >>> >>> [ 653.909423] >>> the existing dependency chain (in reverse order) is: >>> [ 653.910594] >>> -> #1 (reservation_ww_class_mutex){+.+.}: >>> [ 653.911759] __ww_mutex_lock.constprop.15+0xca/0x1120 >>> [ 653.912350] ww_mutex_lock+0x73/0x80 >>> [ 653.913044] amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] >>> [ 653.913724] kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] >>> [ 653.914388] amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] >>> [ 653.915033] amdgpu_device_init+0x1303/0x1e10 [amdgpu] >>> [ 653.915685] amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] >>> [ 653.916349] amdgpu_pci_probe+0x11d/0x200 [amdgpu] >>> [ 653.916959] local_pci_probe+0x47/0xa0 >>> [ 653.917570] work_for_cpu_fn+0x1a/0x30 >>> [ 653.918184] process_one_work+0x29e/0x630 >>> [ 653.918803] worker_thread+0x22b/0x3f0 >>> [ 653.919427] kthread+0x12f/0x150 >>> [ 653.920047] ret_from_fork+0x3a/0x50 >>> [ 653.920661] >>> -> #0 (&adev->reset_sem){.+.+}: >>> [ 653.921893] __lock_acquire+0x13ec/0x16e0 >>> [ 653.922531] lock_acquire+0xb8/0x1c0 >>> [ 653.923174] down_read+0x48/0x230 >>> [ 653.923886] amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] >>> [ 653.924588] drm_ioctl_kernel+0xb6/0x100 [drm] >>> [ 653.925283] drm_ioctl+0x389/0x450 [drm] >>> [ 653.926013] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] >>> [ 653.926686] ksys_ioctl+0x98/0xb0 >>> [ 653.927357] __x64_sys_ioctl+0x1a/0x20 >>> [ 653.928030] do_syscall_64+0x5f/0x250 >>> [ 653.928697] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> [ 653.929373] >>> other info that might help us debug this: >>> >>> [ 653.931356] Possible unsafe locking scenario: >>> >>> [ 653.932647] CPU0 CPU1 >>> [ 653.933287] ---- ---- >>> [ 653.933911] lock(reservation_ww_class_mutex); >>> [ 653.934530] lock(&adev->reset_sem); >>> [ 653.935154] >>> lock(reservation_ww_class_mutex); >>> [ 653.935766] lock(&adev->reset_sem); >>> [ 653.936360] >>> *** DEADLOCK *** >>> >>> [ 653.938028] 2 locks held by amdgpu_test/3975: >>> [ 653.938574] #0: ffffb2a862d6bcd0 >>> (reservation_ww_class_acquire){+.+.}, at: >>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: >>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: >>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] >>> >>> change the order of reservation_ww_class_mutex and adev->reset_sem in >>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, >>> to avoid potential dead lock. >> It may be better to fix it the other way around in >> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the >> reservation. Otherwise you will never be able to take the reset_sem >> while any BOs are reserved. That's probably going to cause you other >> problems later. >> >> That makes me wonder, why do you need the reset_sem in >> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious >> hardware access in that function. Is it for amdgpu_ttm_alloc_gart >> updating the GART table through HDP? >> >> Regards, >> Felix >> >> >>> Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> index ee1e8fff83b2..fc889c477696 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, >>> void *data, >>> abo = NULL; >>> } >>> + down_read(&adev->reset_sem); >>> + >>> amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd); >>> r = ttm_eu_reserve_buffers(&ticket, &list, true, >>> &duplicates); @@ >>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, >>> void *data, >>> bo_va = NULL; >>> } >>> - down_read(&adev->reset_sem); >>> - >>> switch (args->operation) { >>> case AMDGPU_VA_OP_MAP: >>> va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ >>> -701,12 >>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, >>> amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va, >>> args->operation); >>> - up_read(&adev->reset_sem); >>> - >>> error_backoff: >>> ttm_eu_backoff_reservation(&ticket, &list); >>> error_unref: >>> + up_read(&adev->reset_sem); >>> drm_gem_object_put_unlocked(gobj); >>> return r; >>> } > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx