Am 10.09.2018 um 11:47 schrieb Deng, Emily: >> -----Original Message----- >> From: Christian König <ckoenig.leichtzumerken at gmail.com> >> Sent: Monday, September 10, 2018 5:41 PM >> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >> Subject: Re: [PATCH v2] drm/amdgpu: Fix the dead lock issue. >> >> Am 10.09.2018 um 11:34 schrieb Emily Deng: >>> It will ramdomly have the dead lock issue when test TDR: >>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow >>> is waiting for the shadow_list_lock 4. >>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv >>> lock. >>> >>> v2: >>> Make a local copy of the list >>> >>> Signed-off-by: Emily Deng <Emily.Deng at amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++++++++- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 1 + >>> 2 files changed, 12 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index acfc63e..2b9f597 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -3006,6 +3006,9 @@ static int >> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>> long r = 1; >>> int i = 0; >>> long tmo; >>> + struct list_head local_shadow_list; >>> + >>> + INIT_LIST_HEAD(&local_shadow_list); >>> >>> if (amdgpu_sriov_runtime(adev)) >>> tmo = msecs_to_jiffies(8000); >>> @@ -3013,8 +3016,15 @@ static int >> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>> tmo = msecs_to_jiffies(100); >>> >>> DRM_INFO("recover vram bo from shadow start\n"); >>> + >>> mutex_lock(&adev->shadow_list_lock); >>> list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) >>> { >>> + amdgpu_bo_ref(bo); >>> + list_add_tail(&bo->copy_shadow_list, &local_shadow_list); >>> + } >> Please don't add an extra copy_shadow_list field to amdgpu_bo. > If don't use an extra variable, the local shadow list will change according the adev->shadow_list, > both for adding or deleting, it is not we want. That is not correct, see amdgpu_bo_destroy: >        if (!list_empty(&bo->shadow_list)) { >                mutex_lock(&adev->shadow_list_lock); >                list_del_init(&bo->shadow_list); >                mutex_unlock(&adev->shadow_list_lock); >        } The BO is only removed from the list when it is destroyed, since we grabbed a local reference it can't be destroyed. So we are safe here. Regards, Christian. > >> Instead just use bo->shadow list for this. When you hold a reference to the BO >> it should not be removed from the shadow list. >> >> Additional to that you can just use list_splice_init() to move the whole shadow >> list to your local list. >> >> Christian. >> >>> + mutex_unlock(&adev->shadow_list_lock); >>> + >>> + list_for_each_entry_safe(bo, tmp, &local_shadow_list, >>> +copy_shadow_list) { >>> next = NULL; >>> amdgpu_device_recover_vram_from_shadow(adev, ring, bo, >> &next); >>> if (fence) { >>> @@ -3033,8 +3043,8 @@ static int >> amdgpu_device_handle_vram_lost(struct >>> amdgpu_device *adev) >>> >>> dma_fence_put(fence); >>> fence = next; >>> + amdgpu_bo_unref(&bo); >>> } >>> - mutex_unlock(&adev->shadow_list_lock); >>> >>> if (fence) { >>> r = dma_fence_wait_timeout(fence, false, tmo); diff --git >>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>> index 907fdf4..cfee16c 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h >>> @@ -103,6 +103,7 @@ struct amdgpu_bo { >>> struct list_head mn_list; >>> struct list_head shadow_list; >>> }; >>> + struct list_head copy_shadow_list; >>> >>> struct kgd_mem *kfd_bo; >>> };