Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

luis.p.mendes@xxxxxxxxx (Luís Mendes) · Wed, 3 Jan 2018 23:08:47 +0000

Hi Michel, Christian,

Michel, I have tested amd-staging-drm-next at commit "drm/amdgpu/gfx9:
only init the apertures used by KGD (v2)" -
0e4946409d11913523d30bc4830d10b388438c7a and the issues remain, both
on ARMv7 and on x86 amd64.

Christian, in fact if I replay the apitraces obtained on the ARMv7
platform on the AMD64 I am also able to reproduce the GPU hang! So it
is not ARM platform specific. Should I send/upload the apitraces? I
have two of them, typically when one doesn't hang the gpu the other
hangs. One takes about 1GB of disk space while the other takes 2.3GB.
...
[   69.019381] ISO 9660 Extensions: RRIP_1991A
[  213.292094] DMAR: DRHD: handling fault status reg 2
[  213.292102] DMAR: [INTR-REMAP] Request device [00:00.0] fault index
1c [fault reason 38] Blocked an interrupt request due to source-id
verification failure
[  223.406919] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, last signaled seq=25158, last emitted seq=25160
[  223.406926] [drm] IP block:tonga_ih is hung!
[  223.407167] [drm] GPU recovery disabled.

Regards,
LuÃs

On Wed, Jan 3, 2018 at 5:47 PM, LuÃs Mendes <luis.p.mendes at gmail.com> wrote:
> Hi Michel, Christian,
>
> Christian, I have followed your suggestion and I have just submitted a
> bug to fdo at https://bugs.freedesktop.org/show_bug.cgi?id=104481 -
> GPU lockup Polaris 11 - AMD RX 460 and RX 550 on amd64 and on ARMv7
> platforms while playing video.
>
> Michel, amdgpu.dc=0 seems to make no difference. I will try
> amd-staging-drm-next and report back.
>
> Regards,
> LuÃs
>
> On Wed, Jan 3, 2018 at 5:09 PM, Michel DÃ¤nzer <michel at daenzer.net> wrote:
>> On 2018-01-03 12:02 PM, LuÃs Mendes wrote:
>>>
>>> What I believe it seems to be the case is that the GPU lock up only
>>> happens when doing a page flip, since the kernel locks with:
>>> [  243.693200] kworker/u4:3    D    0    89      2 0x00000000
>>> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
>>> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac)
>>> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
>>> (schedule_timeout+0x228/0x444)
>>> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
>>> (dma_fence_default_wait+0x2b4/0x2d8)
>>> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
>>> (dma_fence_wait_timeout+0x40/0x150)
>>> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
>>> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
>>> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
>>> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
>>> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
>>> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
>>> ...
>>
>> Does the problem also occur if you disable DC with amdgpu.dc=0 on the
>> kernel command line?
>>
>> Does it also happen with a kernel built from the amd-staging-drm-next
>> branch instead of drm-next-4.16?
>>
>>
>> --
>> Earthling Michel DÃ¤nzer               |               http://www.amd.com
>> Libre software enthusiast             |             Mesa and X developer