>-----Original Message----- >From: Koenig, Christian >Sent: Monday, September 10, 2018 6:02 PM >To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. > >Am 10.09.2018 um 11:55 schrieb Deng, Emily: >>> -----Original Message----- >>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of >>> Christian König >>> Sent: Monday, September 10, 2018 5:49 PM >>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>> >>> Am 10.09.2018 um 11:47 schrieb Deng, Emily: >>>>> -----Original Message----- >>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com> >>>>> Sent: Monday, September 10, 2018 5:41 PM >>>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >>>>> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >>>>> >>>>> Am 10.09.2018 um 11:34 schrieb Emily Deng: >>>>>> It will ramdomly have the dead lock issue when test TDR: >>>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>>>>> amdgpu_bo_create locked the bo's resv lock 3. >>>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. >>>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's >>>>>> resv lock. >>>>>> >>>>>> v2: >>>>>> Make a local copy of the list >>>>>> >>>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + >>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> index acfc63e..2b9f597 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> @@ -3006,6 +3006,9 @@ static int >>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>> long r = 1; >>>>>> int i = 0; >>>>>> long tmo; >>>>>> + struct list_head local_shadow_list; >>>>>> + >>>>>> + INIT_LIST_HEAD(&local_shadow_list); >>>>>> >>>>>> if (amdgpu_sriov_runtime(adev)) >>>>>> tmo = msecs_to_jiffies(8000); @@ -3013,8 +3016,15 >@@ static >>>>>> int >>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>> tmo = msecs_to_jiffies(100); >>>>>> >>>>>> DRM_INFO("recover vram bo from shadow start\n"); >>>>>> + >>>>>> mutex_lock(&adev->shadow_list_lock); >>>>>> list_for_each_entry_safe(bo, tmp, &adev->shadow_list, >>>>>> shadow_list) { >>>>>> + amdgpu_bo_ref(bo); >>>>>> + list_add_tail(&bo->copy_shadow_list, &local_shadow_list); >>>>>> + } >>>>> Please don't add an extra copy_shadow_list field to amdgpu_bo. >>>> If don't use an extra variable, the local shadow list will change >>>> according the adev->shadow_list, both for adding or deleting, it is >>>> not we >>> want. >>> >>> That is not correct, see amdgpu_bo_destroy: >>>>        if (!list_empty(&bo->shadow_list)) { >>>>                mutex_lock(&adev->shadow_list_lock); >>>>                list_del_init(&bo->shadow_list); >>>>                mutex_unlock(&adev->shadow_list_lock); >>>>        } >>> The BO is only removed from the list when it is destroyed, since we >>> grabbed a local reference it can't be destroyed. So we are safe here. >> Sorry I am not meaning the delete, what about the adding. > >That will still go to adev->shadow_list and not affect our local list in any way. >We are not interested in any newly allocated shadow BOs, so that should be >unproblematic. Understand, will send a patch later. Best wishes Emily Deng >Regards, >Christian. > >>> Regards, >>> Christian. >>> >>>>> Instead just use bo->shadow list for this. When you hold a >>>>> reference to the BO it should not be removed from the shadow list. >>>>> >>>>> Additional to that you can just use list_splice_init() to move the >>>>> whole shadow list to your local list. >>>>> >>>>> Christian. >>>>> >>>>>> + mutex_unlock(&adev->shadow_list_lock); >>>>>> + >>>>>> + list_for_each_entry_safe(bo, tmp, &local_shadow_list, >>>>>> +copy_shadow_list) { >>>>>> next = NULL; >>>>>> amdgpu_device_recover_vram_from_shadow(adev, >ring, bo, >>>>> &next); >>>>>> if (fence) { >>>>>> @@ -3033,8 +3043,8 @@ static int >>>>> amdgpu_device_handle_vram_lost(struct >>>>>> amdgpu_device *adev) >>>>>> >>>>>> dma_fence_put(fence); >>>>>> fence = next; >>>>>> + amdgpu_bo_unref(&bo); >>>>>> } >>>>>> - mutex_unlock(&adev->shadow_list_lock); >>>>>> >>>>>> if (fence) { >>>>>> r = dma_fence_wait_timeout(fence, false, tmo); diff -- >git >>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>> index 907fdf4..cfee16c 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>>>>> @@ -103,6 +103,7 @@ struct amdgpu_bo { >>>>>> struct list_head mn_list; >>>>>> struct list_head shadow_list; >>>>>> }; >>>>>> + struct list_head copy_shadow_list; >>>>>> >>>>>> struct kgd_mem *kfd_bo; >>>>>> }; >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx