Re: [PATCH] drm/amd/display: Fix deadlock with display during hanged ring recovery.

"Kazlauskas, Nicholas" <Nicholas.Kazlauskas@xxxxxxx> · Wed, 13 Feb 2019 19:52:22 +0000

On 2/13/19 2:21 PM, Grodzovsky, Andrey wrote:
> 
> On 2/13/19 2:16 PM, Kazlauskas, Nicholas wrote:
>> On 2/13/19 2:10 PM, Grodzovsky, Andrey wrote:
>>> On 2/13/19 2:00 PM, Kazlauskas, Nicholas wrote:
>>>> On 2/13/19 1:58 PM, Andrey Grodzovsky wrote:
>>>>> When ring hang happens amdgpu_dm_commit_planes during flip is holding
>>>>> the BO reserved and then stack waiting for fences to signal in
>>>>> reservation_object_wait_timeout_rcu (which won't signal because there
>>>>> was a hnag). Then when we try to shutdown display block during reset
>>>>> recovery from drm_atomic_helper_suspend we also try to reserve the BO
>>>>> from dm_plane_helper_cleanup_fb ending in deadlock.
>>>>> Also remove useless WARN_ON
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 19 +++++++++++++------
>>>>>       1 file changed, 13 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> index acc4ff8..f8dec36 100644
>>>>> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
>>>>> @@ -4802,14 +4802,21 @@ static void amdgpu_dm_commit_planes(struct drm_atomic_state *state,
>>>>>       			 */
>>>>>       			abo = gem_to_amdgpu_bo(fb->obj[0]);
>>>>>       			r = amdgpu_bo_reserve(abo, true);
>>>>> -			if (unlikely(r != 0)) {
>>>>> +			if (unlikely(r != 0))
>>>>>       				DRM_ERROR("failed to reserve buffer before flip\n");
>>>>> -				WARN_ON(1);
>>>>> -			}
>>>>>       
>>>>> -			/* Wait for all fences on this FB */
>>>>> -			WARN_ON(reservation_object_wait_timeout_rcu(abo->tbo.resv, true, false,
>>>>> -										    MAX_SCHEDULE_TIMEOUT) < 0);
>>>>> +			/*
>>>>> +			 * Wait for all fences on this FB. Do limited wait to avoid
>>>>> +			 * deadlock during GPU reset when this fence will not signal
>>>>> +			 * but we hold reservation lock for the BO.
>>>>> +			 */
>>>>> +			r = reservation_object_wait_timeout_rcu(abo->tbo.resv,
>>>>> +								true, false,
>>>>> +								msecs_to_jiffies(5000));
>>>>> +			if (unlikely(r == 0))
>>>>> +				DRM_ERROR("Waiting for fences timed out.");
>>>>> +
>>>>> +
>>>>>       
>>>>>       			amdgpu_bo_get_tiling_flags(abo, &tiling_flags);
>>>>>       
>>>>>
>>>> Is it safe that we're just continuing like this? It's probably better to
>>>> just unreserve the buffer and continue to the next plane, no?
>>>>
>>>> Nicholas Kazlauskas
>>> As far as I see it should be safe as you are simply flipping to a buffer
>>> for which rendering hasn't finished (or stack actually in this case) so
>>> you might see visual corruption but that the least of your problems if
>>> after 5s the BO still not finalized for presentation, the system is
>>> already probably in very bad shape. Also, in case we do want to  do
>>> error handling we should also take care of  amdgpu_bo_reserve failure
>>> just before that.
>>>
>>> Andrey
>>>
>>>
>> Yeah, I guess this whole blocks needs to be cleaned up in that case.
>> This is a good first step at least. Technically
>> reservation_object_wait_timeout_rcu will return < 0 when it's been
>> interrupted too as an error code but I guess that will just be silently
>> ignored here.
>>
>> If you want you can change the condition to:
>>
>> if (unlikely(r >= 0))
>>         DRM_ERROR("Waiting for FB fence failed: id=%d res=%d\n",
>> plane->id, r);
> 
> 
> Note that reservation_object_wait_timeout_rcu has a flag 'bool intr: if
> true, do interruptible wait', we set it to false since the code in case
> of flip runs form the kernel worker thread and not from IOCTL meaning we
> are not in user mode context and hence are not going to recieve user
> signals (cannot be interrupted). So the only values we can recieve here
> are either 0 for time out or val > 0 for wait that completed before time
> out value.
> 
> Andrey

Oh, right.

This patch is good as-is then.

Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@xxxxxxx>

> 
>>
>> But with or without that change this patch is:
>>
>> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@xxxxxxx>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx