---- On Tue, 27 Dec 2016 12:51:37 -0800 Christian König <christian.koenig at amd.com> wrote ---- > It's a well known problem that the completion interrupts are notorious > unreliable. > > That's why we have a fallback timer in amdgpu_fence.c which kicks an > extra hardware probe after a certain timeout. Please double check that > this one is working as expected. I'm digging in to why the fallback process isn't signalling the straggling fences. do { last_seq = atomic_read(&ring->fence_drv.last_seq); seq = amdgpu_fence_read(ring); } while (atomic_cmpxchg(&drv->last_seq, last_seq, seq) != last_seq); if (seq != ring->fence_drv.sync_seq) { printf("rescheduling fallback for %s\n", ring->name); amdgpu_fence_schedule_fallback(ring); } if (unlikely(seq == last_seq)) { printf("seek == last_seq == %u skipping fence_process\n", seq); return; } Dec 28 00:22:31 daleks kernel: &fence->finished at 79042060348 f 353#2026: signaled from irq context Dec 28 00:22:31 daleks kernel: fence at 79042062972 f 0#4598: signaled from process context Dec 28 00:22:31 daleks kernel: &fence->scheduled at 79042069573 f 74#2353: signaled from irq context Dec 28 00:22:31 daleks kernel: skipping fallback scheduling for gfx Dec 28 00:22:31 daleks kernel: &fence->finished at 79042112606 f 75#2353: signaled from irq context Dec 28 00:22:31 daleks kernel: fence at 79042115268 f 0#4599: signaled from process context Dec 28 00:22:31 daleks kernel: &fence->scheduled at 79042168961 f 352#2027: signaled from irq context Dec 28 00:22:31 daleks kernel: skipping fallback scheduling for gfx Dec 28 00:22:31 daleks kernel: &fence->finished at 79042234434 f 353#2027: signaled from irq context Dec 28 00:22:31 daleks kernel: fence at 79042237108 f 0#4600: signaled from process context Dec 28 00:22:31 daleks kernel: 353#2028 sleeping tid 100721 at 79042673751 Dec 28 00:22:31 daleks kernel: running fence fallback for sdma0 Dec 28 00:22:31 daleks kernel: seek == last_seq == 607 skipping fence_process Dec 28 00:22:31 daleks kernel: running fence fallback for gfx Dec 28 00:22:31 daleks kernel: seek == last_seq == 4600 skipping fence_process It looks like the sequence numbers are saying that the device did in fact complete? Too tired to think about it further now. > > Another possibility is that the memory where the fence is written > doesn't has the proper attributes (e.g. USWC vs. cached vs. uncached). The only places where I see I memory attributes being set is in amdgpu_device_init for rmmio and the doorbell bar mapping in amdgpu_doorbell_init. The ioremap function will remap the memory uncacheable. The driver is unmodified from Linus' tree as of "drm/amdgpu: add gart recovery by gtt list V2" - about two thirds of the way through 4.9-rc1 (modulo git merge issues). Is there any place else I should be looking? Turning on INVARIANTS which scribbles memory on free (and thus aggressively flushing the cache) causes the hangs to take much much longer to occur - leading me to believe that it may well be a memory typing issue. Thanks for getting back to me. -M P.S. A bit of a tangeent - but maybe you could also clarify if I'm doing something wrong when replaying commits from Linus' tree. The way I get the changesets and the sequence is by doing: % git format-patch v4.8..v4.9-rc1 drivers/gpu/drm/*.* drivers/gpu/drm/i915 drivers/gpu/drm/amd drivers/gpu/drm/radeon include/drm include/uapi/drm 'git am' fails much of the time even when there aren't conflicts so what I do is I git cherry-pick the changesets in the order that they show up in the generated patches. I frequently end up with empty commits and sometimes the drivers will not end up with all the requisite changes merged in such that it doesn't compile. > Regards, > Christian. > > Am 26.12.2016 um 02:54 schrieb Matthew Macy: > > I'm running an rx460 using the amdgpu driver from Linux 4.8 with Mesa 13/LLVM 3.9 and Xorg 1.18 on FreeBSD. It seems to largely perform pretty well. > > > > However, ever since I got Mesa working I will inevitably end up losing completion interrupts after X has been running for a brief period. I can bring the problem on more quickly by running glxgears with vblank_mode=0. It's a safe bet that the problem with the linuxkpi. However, since this bug is manifesting itself in a very hardware specific way I'm coming here for advice on what I can do to dump device state to better understand why it ceases to fire interrupts. > > > > I enabled FENCE_TRACE and added some logging to fence creation and fence_default_wait as well. The last interrupt in this particular excerpt is: > > > > "Dec 22 22:36:22 daleks kernel: fence at 210850477167 f 0#233745: signaled from irq context" > > > > amdgpu_cs_wait goes on to sleep on 411#116530 and never wake up. Any guidance would be much appreciated. Thanks in advance. > > > > > > > > > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850212762 f 86#116944: signaled from irq context > > Dec 22 22:36:22 daleks kernel: pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_CS > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 864, wptr 880 > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850251259 f 411#116528: signaled from irq context > > Dec 22 22:36:22 daleks kernel: fence at 210850253222 f 0#233742: signaled from irq context > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 880, wptr 880 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] created fence 410#116529 411#116529 @210850271550 > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 880, wptr 896 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] [drm:gfx_v8_0_eop_irq] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: IH: CP EOP > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850308909 f 87#116944: signaled from irq context > > Dec 22 22:36:22 daleks kernel: fence at 210850310670 f 0#233743: signaled from irq context > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 896, wptr 896 > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850325151 f 410#116529: signaled from irq context > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS > > Dec 22 22:36:22 daleks kernel: created fence 86#116945 87#116945 @210850375328 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850389385 f 86#116945: signaled from irq context > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] amdgpu_ih_process: rptr 896, wptr 912 > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850416620 f 411#116529: signaled from irq context > > Dec 22 22:36:22 daleks kernel: fence at 210850418382 f 0#233744: signaled from irq context > > Dec 22 22:36:22 daleks kernel: pid=100793, dev=0xe200, auth=1, AMDGPU_CS > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] created fence 410#116530 411#116530 @210850440720 > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 912, wptr 912 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 912, wptr 928 > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850475397 f 87#116945: signaled from irq context > > Dec 22 22:36:22 daleks kernel: fence at 210850477167 f 0#233745: signaled from irq context > > Dec 22 22:36:22 daleks kernel: pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 928, wptr 928 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS > > Dec 22 22:36:22 daleks kernel: created fence 86#116946 87#116946 @210850557790 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_CS > > Dec 22 22:36:22 daleks kernel: created fence 410#116531 411#116531 @210850614023 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS > > Dec 22 22:36:22 daleks kernel: created fence 86#116947 87#116947 @210850719230 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_WAIT_CS > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] amdgpu_cs_wait on 411#116530 > > Dec 22 22:36:22 daleks kernel: pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST > > Dec 22 22:36:22 daleks kernel: 411#116530 sleeping tid 100793 at 210850747487 > > > > > > -M > > > > _______________________________________________ > > amd-gfx mailing list > > amd-gfx at lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx >