Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

luis.p.mendes@xxxxxxxxx (Luís Mendes) · Wed, 3 Jan 2018 11:02:10 +0000

Hi Christian,

Replies follow in between.

Regards,
LuÃs

On Wed, Jan 3, 2018 at 9:37 AM, Christian KÃ¶nig
<ckoenig.leichtzumerken at gmail.com> wrote:
> Hi Luis,
>
> In general please add information like /proc/iomem and dmesg as attachment
> and not mangled inside the mail.

Ok, I'll take that into account next time. Sorry for the inconvenience.

>
> The good news is that your ARM board at least has a memory layout which
> should work in theory. So at least one problem rules out.

Ok, nice.

>
> I don't think that apitrace would be much helpful in this case as long as no
> developer has access to one of those ARM boards. But it is interesting that
> the apitrace reliable reproduces the issue. This means that it isn't
> something random, but rather a specific timing of things.

I am afraid, I currently don't have boards that I can send yet. I am
developing one, but it will still take some time, before I have one
ready.

I've checked the apitrace and there is a common call
glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will
trigger the page flip. I suspect there is a race condition with
glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent
to the GPU causing an hang.
What I believe it seems to be the case is that the GPU lock up only
happens when doing a page flip, since the kernel locks with:
[  243.693200] kworker/u4:3    D    0    89      2 0x00000000
[  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
[  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac)
[  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
(schedule_timeout+0x228/0x444)
[  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
(dma_fence_default_wait+0x2b4/0x2d8)
[  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
(dma_fence_wait_timeout+0x40/0x150)
[  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
(reservation_object_wait_timeout_rcu+0xfc/0x34c)
[  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
[<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
[  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
[<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
...

I will try to reproduce this on x86 with a similar software stack...
and the apitrace traces I got.
What do you think, does this makes sense? Do you have further
suggestions that may help pin down the problem?

Another strange thing... the traces that were consistently causing
hangs yesterday, today are having a bit more difficulty causing them,
but if I play the video with kodi it hangs easily again. Both kodi and
glretarce always hangs with similar kernel backtraces, like the one
above.