[AMD Official Use Only - Internal Distribution Only] Am 12.08.20 um 12:02 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Am 12.08.20 um 11:23 schrieb Li, Dennis: >> [AMD Official Use Only - Internal Distribution Only] >> >> Am 12.08.20 um 03:33 schrieb Li, Dennis: >>> [AMD Official Use Only - Internal Distribution Only] >>> >>> Hi, Christian, >>> >>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me. >>> >>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map and amdgpu_gem_va_update_vm all a chance to access hardware. >> This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason. >> >> The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems. >> >> [Dennis Li] amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> >> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt > That is for pre gfx9 hardware and only called once during initial enabling of the feature. > > Please remove that locking again since it is clearly completely against the driver design. > > [Dennis Li] okay, if you agree, I will change to only protect amdgpu_gem_va_update_vm in this function. Better even only protect the amdgpu_vm_update_prt_state() function. [Dennis Li] Got it. According to your suggestion, I will also narrow down the scope of reset_sem in other functions. Christian. > > Christian. > >> Regards, >> Christian. >> >>> Best Regards >>> Dennis Li >>> -----Original Message----- >>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>> Sent: Wednesday, August 12, 2020 12:15 AM >>> To: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Li, Dennis >>> <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, >>> Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking >>> <Hawking.Zhang@xxxxxxx> >>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking >>> dependency >>> >>> Am 11.08.20 um 15:57 schrieb Felix Kuehling: >>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li: >>>>> [ 653.902305] >>>>> ====================================================== >>>>> [ 653.902928] WARNING: possible circular locking dependency detected >>>>> [ 653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >>>>> [ 653.904098] >>>>> ------------------------------------------------------ >>>>> [ 653.904675] amdgpu_test/3975 is trying to acquire lock: >>>>> [ 653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: >>>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [ 653.905953] >>>>> but task is already holding lock: >>>>> [ 653.907087] ffff9744adbee1f8 >>>>> (reservation_ww_class_mutex){+.+.}, >>>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [ 653.907694] >>>>> which lock already depends on the new lock. >>>>> >>>>> [ 653.909423] >>>>> the existing dependency chain (in reverse order) is: >>>>> [ 653.910594] >>>>> -> #1 (reservation_ww_class_mutex){+.+.}: >>>>> [ 653.911759] __ww_mutex_lock.constprop.15+0xca/0x1120 >>>>> [ 653.912350] ww_mutex_lock+0x73/0x80 >>>>> [ 653.913044] amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu] >>>>> [ 653.913724] kgd2kfd_device_init+0x13f/0x5e0 [amdgpu] >>>>> [ 653.914388] amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu] >>>>> [ 653.915033] amdgpu_device_init+0x1303/0x1e10 [amdgpu] >>>>> [ 653.915685] amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu] >>>>> [ 653.916349] amdgpu_pci_probe+0x11d/0x200 [amdgpu] >>>>> [ 653.916959] local_pci_probe+0x47/0xa0 >>>>> [ 653.917570] work_for_cpu_fn+0x1a/0x30 >>>>> [ 653.918184] process_one_work+0x29e/0x630 >>>>> [ 653.918803] worker_thread+0x22b/0x3f0 >>>>> [ 653.919427] kthread+0x12f/0x150 >>>>> [ 653.920047] ret_from_fork+0x3a/0x50 >>>>> [ 653.920661] >>>>> -> #0 (&adev->reset_sem){.+.+}: >>>>> [ 653.921893] __lock_acquire+0x13ec/0x16e0 >>>>> [ 653.922531] lock_acquire+0xb8/0x1c0 >>>>> [ 653.923174] down_read+0x48/0x230 >>>>> [ 653.923886] amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] >>>>> [ 653.924588] drm_ioctl_kernel+0xb6/0x100 [drm] >>>>> [ 653.925283] drm_ioctl+0x389/0x450 [drm] >>>>> [ 653.926013] amdgpu_drm_ioctl+0x4f/0x80 [amdgpu] >>>>> [ 653.926686] ksys_ioctl+0x98/0xb0 >>>>> [ 653.927357] __x64_sys_ioctl+0x1a/0x20 >>>>> [ 653.928030] do_syscall_64+0x5f/0x250 >>>>> [ 653.928697] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>> [ 653.929373] >>>>> other info that might help us debug this: >>>>> >>>>> [ 653.931356] Possible unsafe locking scenario: >>>>> >>>>> [ 653.932647] CPU0 CPU1 >>>>> [ 653.933287] ---- ---- >>>>> [ 653.933911] lock(reservation_ww_class_mutex); >>>>> [ 653.934530] lock(&adev->reset_sem); >>>>> [ 653.935154] lock(reservation_ww_class_mutex); >>>>> [ 653.935766] lock(&adev->reset_sem); >>>>> [ 653.936360] >>>>> *** DEADLOCK *** >>>>> >>>>> [ 653.938028] 2 locks held by amdgpu_test/3975: >>>>> [ 653.938574] #0: ffffb2a862d6bcd0 >>>>> (reservation_ww_class_acquire){+.+.}, at: >>>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [ 653.939233] #1: >>>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: >>>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] >>>>> >>>>> change the order of reservation_ww_class_mutex and adev->reset_sem >>>>> in amdgpu_gem_va_ioctl the same as ones in >>>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock. >>>> It may be better to fix it the other way around in >>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the >>>> reservation. Otherwise you will never be able to take the reset_sem >>>> while any BOs are reserved. That's probably going to cause you >>>> other problems later. >>>> >>>> That makes me wonder, why do you need the reset_sem in >>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious >>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart >>>> updating the GART table through HDP? >>> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. >>> >>> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me. >>> >>> Christian. >>> >>>> Regards, >>>> Felix >>>> >>>> >>>>> Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>>> index ee1e8fff83b2..fc889c477696 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, >>>>> abo = NULL; >>>>> } >>>>> >>>>> + down_read(&adev->reset_sem); >>>>> + >>>>> amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd); >>>>> >>>>> r = ttm_eu_reserve_buffers(&ticket, &list, true, >>>>> &duplicates); @@ >>>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, >>>>> bo_va = NULL; >>>>> } >>>>> >>>>> - down_read(&adev->reset_sem); >>>>> - >>>>> switch (args->operation) { >>>>> case AMDGPU_VA_OP_MAP: >>>>> va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ >>>>> -701,12 >>>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void >>>>> +*data, >>>>> amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va, >>>>> args->operation); >>>>> >>>>> - up_read(&adev->reset_sem); >>>>> - >>>>> error_backoff: >>>>> ttm_eu_backoff_reservation(&ticket, &list); >>>>> >>>>> error_unref: >>>>> + up_read(&adev->reset_sem); >>>>> drm_gem_object_put_unlocked(gobj); >>>>> return r; >>>>> } _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx