Deadlocks with multiple applications on AMD RX 460 and RX 550 - Update 2

luis.p.mendes@xxxxxxxxx (Luís Mendes) · Wed, 3 Jan 2018 11:56:24 +0000

Hi Christian, David,

David, replying to your question... The issue is indeed reproducible
on x86, I just did it with kodi and the same VP9 video. So it is not
arm specific.

Regards,
LuÃs

On Wed, Jan 3, 2018 at 11:02 AM, LuÃs Mendes <luis.p.mendes at gmail.com> wrote:
> Hi Christian,
>
> Replies follow in between.
>
> Regards,
> LuÃs
>
> On Wed, Jan 3, 2018 at 9:37 AM, Christian KÃ¶nig
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Hi Luis,
>>
>> In general please add information like /proc/iomem and dmesg as attachment
>> and not mangled inside the mail.
>
> Ok, I'll take that into account next time. Sorry for the inconvenience.
>
>>
>> The good news is that your ARM board at least has a memory layout which
>> should work in theory. So at least one problem rules out.
>
> Ok, nice.
>
>>
>> I don't think that apitrace would be much helpful in this case as long as no
>> developer has access to one of those ARM boards. But it is interesting that
>> the apitrace reliable reproduces the issue. This means that it isn't
>> something random, but rather a specific timing of things.
>
> I am afraid, I currently don't have boards that I can send yet. I am
> developing one, but it will still take some time, before I have one
> ready.
>
> I've checked the apitrace and there is a common call
> glXSwapBuffers(dpy=0x1389f00, drawable=52428803) that I believe will
> trigger the page flip. I suspect there is a race condition with
> glXSwapBuffers in mesa or amdgpu, that corrupts some of the data sent
> to the GPU causing an hang.
> What I believe it seems to be the case is that the GPU lock up only
> happens when doing a page flip, since the kernel locks with:
> [  243.693200] kworker/u4:3    D    0    89      2 0x00000000
> [  243.693232] Workqueue: events_unbound commit_work [drm_kms_helper]
> [  243.693251] [<80b8c6d4>] (__schedule) from [<80b8cdd0>] (schedule+0x4c/0xac)
> [  243.693259] [<80b8cdd0>] (schedule) from [<80b91024>]
> (schedule_timeout+0x228/0x444)
> [  243.693270] [<80b91024>] (schedule_timeout) from [<80886738>]
> (dma_fence_default_wait+0x2b4/0x2d8)
> [  243.693276] [<80886738>] (dma_fence_default_wait) from [<80885d60>]
> (dma_fence_wait_timeout+0x40/0x150)
> [  243.693284] [<80885d60>] (dma_fence_wait_timeout) from [<80887b1c>]
> (reservation_object_wait_timeout_rcu+0xfc/0x34c)
> [  243.693509] [<80887b1c>] (reservation_object_wait_timeout_rcu) from
> [<7f331988>] (amdgpu_dm_do_flip+0xec/0x36c [amdgpu])
> [  243.693789] [<7f331988>] (amdgpu_dm_do_flip [amdgpu]) from
> [<7f33309c>] (amdgpu_dm_atomic_commit_tail+0xbfc/0xe58 [amdgpu])
> ...
>
> I will try to reproduce this on x86 with a similar software stack...
> and the apitrace traces I got.
> What do you think, does this makes sense? Do you have further
> suggestions that may help pin down the problem?
>
> Another strange thing... the traces that were consistently causing
> hangs yesterday, today are having a bit more difficulty causing them,
> but if I play the video with kodi it hangs easily again. Both kodi and
> glretarce always hangs with similar kernel backtraces, like the one
> above.