>-----Original Message----- >From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Deng, >Emily >Sent: Monday, September 10, 2018 6:33 PM >To: Koenig, Christian <Christian.Koenig at amd.com>; amd- >gfx at lists.freedesktop.org >Subject: RE: [PATCH v2] drm/amdgpu: Fix the dead lock issue. > >>-----Original Message----- >>From: Koenig, Christian >>Sent: Monday, September 10, 2018 6:02 PM >>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >>Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >> >>Am 10.09.2018 um 11:55 schrieb Deng, Emily: >>>> -----Original Message----- >>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of >>>> Christian König >>>> Sent: Monday, September 10, 2018 5:49 PM >>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>>> >>>> Am 10.09.2018 um 11:47 schrieb Deng, Emily: >>>>>> -----Original Message----- >>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com> >>>>>> Sent: Monday, September 10, 2018 5:41 PM >>>>>> To: Deng, Emily <Emily.Deng at amd.com>; >>>>>> amd-gfx at lists.freedesktop.org >>>>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>>>>> >>>>>> Am 10.09.2018 um 11:34 schrieb Emily Deng: >>>>>>> It will ramdomly have the dead lock issue when test TDR: >>>>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>>>>>> amdgpu_bo_create locked the bo's resv lock 3. >>>>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. >>>>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's >>>>>>> resv lock. >>>>>>> >>>>>>> v2: >>>>>>> Make a local copy of the list >>>>>>> >>>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 >+++++++++++- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + >>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> index acfc63e..2b9f597 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -3006,6 +3006,9 @@ static int >>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>> long r = 1; >>>>>>> int i = 0; >>>>>>> long tmo; >>>>>>> + struct list_head local_shadow_list; >>>>>>> + >>>>>>> + INIT_LIST_HEAD(&local_shadow_list); >>>>>>> >>>>>>> if (amdgpu_sriov_runtime(adev)) >>>>>>> tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 >>@@ static >>>>>>> int >>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>> tmo = msecs_to_jiffies(100); >>>>>>> >>>>>>> DRM_INFO("recover vram bo from shadow start\n"); >>>>>>> + >>>>>>> mutex_lock(&adev->shadow_list_lock); >>>>>>> list_for_each_entry_safe(bo, tmp, &adev->shadow_list, >>>>>>> shadow_list) { >>>>>>> + amdgpu_bo_ref(bo); >>>>>>> + list_add_tail(&bo->copy_shadow_list, >&local_shadow_list); >>>>>>> + } >>>>>> Please don't add an extra copy_shadow_list field to amdgpu_bo. >>>>> If don't use an extra variable, the local shadow list will change >>>>> according the adev->shadow_list, both for adding or deleting, it is >>>>> not we >>>> want. >>>> >>>> That is not correct, see amdgpu_bo_destroy: >>>>>        if (!list_empty(&bo->shadow_list)) { >>>>>                mutex_lock(&adev->shadow_list_lock); >>>>>                list_del_init(&bo->shadow_list); >>>>>                mutex_unlock(&adev->shadow_list_lock); >>>>>        } >>>> The BO is only removed from the list when it is destroyed, since we >>>> grabbed a local reference it can't be destroyed. So we are safe here. The amdgpu_bo_ref could only avoid deleting the bo, but when it is already in amdgpu_bo_destroy, the shadow_list seems still will be delete. >>> Sorry I am not meaning the delete, what about the adding. >> >>That will still go to adev->shadow_list and not affect our local list in any way. > >>We are not interested in any newly allocated shadow BOs, so that should >>be unproblematic. >Understand, will send a patch later. > >Best wishes >Emily Deng >>Regards, >>Christian. >> >>>> Regards, >>>> Christian. >>>> >>>>>> Instead just use bo->shadow list for this. When you hold a >>>>>> reference to the BO it should not be removed from the shadow list. >>>>>> >>>>>> Additional to that you can just use list_splice_init() to move the >>>>>> whole shadow list to your local list. >>>>>> >>>>>> Christian. >>>>>> >>>>>>> + mutex_unlock(&adev->shadow_list_lock); >>>>>>> + >>>>>>> + list_for_each_entry_safe(bo, tmp, &local_shadow_list, >>>>>>> +copy_shadow_list) { >>>>>>> next = NULL; >>>>>>> amdgpu_device_recover_vram_from_shadow(adev, >>ring, bo, >>>>>> &next); >>>>>>> if (fence) { >>>>>>> @@ -3033,8 +3043,8 @@ static int >>>>>> amdgpu_device_handle_vram_lost(struct >>>>>>> amdgpu_device *adev) >>>>>>> >>>>>>> dma_fence_put(fence); >>>>>>> fence = next; >>>>>>> + amdgpu_bo_unref(&bo); >>>>>>> } >>>>>>> - mutex_unlock(&adev->shadow_list_lock); >>>>>>> >>>>>>> if (fence) { >>>>>>> r = dma_fence_wait_timeout(fence, false, tmo); diff -- >>git >>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>> index 907fdf4..cfee16c 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>>> @@ -103,6 +103,7 @@ struct amdgpu_bo { >>>>>>> struct list_head mn_list; >>>>>>> struct list_head shadow_list; >>>>>>> }; >>>>>>> + struct list_head copy_shadow_list; >>>>>>> >>>>>>> struct kgd_mem *kfd_bo; >>>>>>> }; >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >_______________________________________________ >amd-gfx mailing list >amd-gfx at lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/amd-gfx