Re: "ring gfx timeout" with Vega 64 on mesa 19.0.0-rc2 and kernel 5.0.0-rc6 (GPU reset still not works)

"Grodzovsky, Andrey" <Andrey.Grodzovsky@xxxxxxx> · Tue, 12 Feb 2019 18:42:33 +0000

Sure, that probably would be the solution, one missing detail here 
(besides confirming with the debug prints that this is the scenario we 
are hitting) is WHY we even stuck in 
reservation_object_wait_timeout_rcu, in amdgpu_device_pre_asic_reset 
(during GPU reset) we are first forcing all outstanding HW fences 
completion through amdgpu_fence_driver_force_completion BEFORE 
proceeding to ip blocks suspend in amdgpu_device_ip_suspend. One 
possible explanation would be that the fence attached to the BO is a 
scheduler fence (SW fence) and not the backing HW fence, I will be able 
to verify this with some fence traces after confirming that the deadlock 
indeed is the one I described.

Andrey

On 2/12/19 1:29 PM, Kazlauskas, Nicholas wrote:
> The MAX_SCHEDULE_TIMEOUT is probably not a good idea on the wait in DM.
>
> I wonder if we could just do shorter wait and skip the FB
> update/programming if it fails after some reasonable amount of time.
>
> This would still allow recovery to happen at least even if the display
> isn't showing the right buffer.
>
> Nicholas Kazlauskas
>
> On 2/12/19 12:46 PM, Grodzovsky, Andrey wrote:
>> I suspect the issue is that amdgpu_dm_do_flip is holding the BO reserved
>> and then stack waiting for fences to signal in
>> reservation_object_wait_timeout_rcu (which won't signal because there
>> was a VM_FAULT). Then when we try to shutdown display block during reset
>> recovery from drm_atomic_helper_suspend we also try to reserve the BO,
>> probably from dm_plane_helper_cleanup_fb ending in deadlock.
>>
>> To confirm i am attaching some printks around the BO reservation -
>> please apply and rerun.
>>
>> Also, probably a good idea to open FDO ticket on this instead of using
>> amd-gfx.
>>
>> Andrey
>>
>>
>> On 2/12/19 10:49 AM, Mikhail Gavrilov wrote:
>>> On Tue, 12 Feb 2019 at 20:23, Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@xxxxxxx> wrote:
>>>> It should recover you - so this looks like a bug. I noticed in one of
>>>> the call traces this - drm_atomic_helper_suspend which points to system
>>>> going into sleep mode, is it what happened, did it hang when system
>>>> tried to sleep ?
>>>>
>>> It's weird because the computer was not enter in sleep mode. I am sure.
>>> Steps for reproduce:
>>> 1. Launch Shadow of The tomb Rider on Proton2. Wait some time until mouse stop respond
>>> 3. Dump gfx, waves and all other dumps including dmesg
>>>
>>> And of course the power button (button which enter in sleep mode) was
>>> not pressed.
>>>
>>> So the new dumps has any new useful info? Or they are pointless?
>>> --
>>> Best Regards,
>>> Mike Gavrilov.
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx