On 2018å¹´09æ??11æ?¥ 11:23, Deng, Emily wrote: >> -----Original Message----- >> From: Zhou, David(ChunMing) >> Sent: Tuesday, September 11, 2018 11:03 AM >> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org >> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue. >> >> >> >> On 2018å¹´09æ??11æ?¥ 10:51, Emily Deng wrote: >>> It will ramdomly have the dead lock issue when test TDR: >>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>> amdgpu_bo_create locked the bo's resv lock 3. amdgpu_bo_create_shadow >>> is waiting for the shadow_list_lock 4. >>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv >>> lock. >>> >>> v2: >>> Make a local copy of the list >>> >>> Signed-off-by: Emily Deng <Emily.Deng at amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 >> ++++++++++++++++++++- >>> 1 file changed, 20 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 2a21267..8c81404 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -3105,6 +3105,9 @@ static int >> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>> long r = 1; >>> int i = 0; >>> long tmo; >>> + struct list_head local_shadow_list; >>> + >>> + INIT_LIST_HEAD(&local_shadow_list); >>> >>> if (amdgpu_sriov_runtime(adev)) >>> tmo = msecs_to_jiffies(8000); >>> @@ -3112,8 +3115,19 @@ static int >> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>> tmo = msecs_to_jiffies(100); >>> >>> DRM_INFO("recover vram bo from shadow start\n"); >>> + >>> + mutex_lock(&adev->shadow_list_lock); >>> + list_splice_init(&adev->shadow_list, &local_shadow_list); >>> + mutex_unlock(&adev->shadow_list_lock); >>> + >>> + >>> mutex_lock(&adev->shadow_list_lock); >> local_shadow_list is a local variable, I think it doesn't need lock at all, no one >> change it. Otherwise looks good to me. > The bo->shadow_list which now is in local_shadow_list maybe destroy in case that it already in amdgpu_bo_destroy, then it will > change local_shadow_list, so need lock the shadow_list_lock. Ah, sorry for noise, I forget you don't reference these BOs. Thanks, David Zhou > Best wishes > Emily Deng >> Thanks, >> David Zhou >>> - list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) { >>> + list_for_each_entry_safe(bo, tmp, &local_shadow_list, shadow_list) { >>> + mutex_unlock(&adev->shadow_list_lock); >>> + >>> + if (!bo) >>> + continue; >>> + >>> next = NULL; >>> amdgpu_device_recover_vram_from_shadow(adev, ring, bo, >> &next); >>> if (fence) { >>> @@ -3132,9 +3146,14 @@ static int >>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>> >>> dma_fence_put(fence); >>> fence = next; >>> + mutex_lock(&adev->shadow_list_lock); >>> } >>> mutex_unlock(&adev->shadow_list_lock); >>> >>> + mutex_lock(&adev->shadow_list_lock); >>> + list_splice_init(&local_shadow_list, &adev->shadow_list); >>> + mutex_unlock(&adev->shadow_list_lock); >>> + >>> if (fence) { >>> r = dma_fence_wait_timeout(fence, false, tmo); >>> if (r == 0)