No fallback for fences created by amd_sched_fence_create? Re: Losing completion interrupts with amdgpu on rx460

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





 ---- On Thu, 29 Dec 2016 06:16:29 -0800 Christian König <christian.koenig at amd.com> wrote ---- 
 > Scheduler fences are a pure software fence implementation.
 > 
 > E.g. they encapsulate only a hardware fence which is created later than 
 > the software sheduler fence and has a separate sequence number range.
 > 
 > So there is no fallback necessary for them.
 > 
 >  From your description and tracing I have to agree that it actually 
 > doesn't look like a hardware issue at all.
 > 
 > E.g. your hardware fences fire as expected and interrupt come down as 
 > expected as well.
 > 
 > But when the fence is signaled your task waiting for this isn't woken up 
 > for some reason.
 > 
 > That might point us to a problem with the task handling function in the 
 > linuxkpi, but I'm really not into how that works on BSD.

It appears that somehow wakeups to amd_sched_main periodically get lost. There's clearly a hole there in the linuxkpi that doesn't exist on Linux. I've put a band aid on it by replacing the schedule() in the wait_event_interruptible there with a schedule_timeout wrapper that caps its wait at 100ms, so no matter what it will check the condition soon enough. Everything seems to be responsive and usable now at least for an hour or two before something else happens that keeps cs_wait from getting signalled :-/.

Thanks for all you're help. 

-M


 > Regards,
 > Christian.
 > 
 > Am 28.12.2016 um 22:17 schrieb Matthew Macy:
 > >   ---- On Wed, 28 Dec 2016 00:35:32 -0800 Matthew Macy <mmacy at nextbsd.org> wrote ----
 > >   >  ---- On Tue, 27 Dec 2016 12:51:37 -0800 Christian König <christian.koenig at amd.com> wrote ----
 > >   >  > It's a well known problem that the completion interrupts are notorious
 > >   >  > unreliable.
 > >   >  >
 > >   >  > That's why we have a fallback timer in amdgpu_fence.c which kicks an
 > >   >  > extra hardware probe after a certain timeout. Please double check that
 > >   >  > this one is working as expected.
 > >   >
 > >   > I'm digging in to why the fallback process isn't signalling the straggling fences.
 > >
 > > It looks like you have two classes of fences. One is processed by fence_process and will be handled by being driven through fallback and the other which relies completely on interrupts. It looks like interrupts don't cut out completely, but this is "forgotten about". Can you please tell me what I'm missing?
 > >
 > > amdgpu_cs_submit -> amd_sched_job_init ->amd_sched_fence_create
 > >
 > > Here the sequence number is based on the scheduling entity:
 > >
 > > struct amd_sched_fence *amd_sched_fence_create(struct amd_sched_entity *entity,
 > >                            void *owner)
 > > {
 > >     struct amd_sched_fence *fence = NULL;
 > >     unsigned seq;
 > >
 > >     fence = kmem_cache_zalloc(sched_fence_slab, GFP_KERNEL);
 > >     if (fence == NULL)
 > >         return NULL;
 > >
 > >     fence->owner = owner;
 > >     fence->sched = entity->sched;
 > >     spin_lock_init(&fence->lock);
 > >
 > >     seq = atomic_inc_return(&entity->fence_seq);
 > >     fence_init(&fence->scheduled, &amd_sched_fence_ops_scheduled,
 > >            &fence->lock, entity->fence_context, seq);
 > >     fence_init(&fence->finished, &amd_sched_fence_ops_finished,
 > >            &fence->lock, entity->fence_context + 1, seq);
 > >
 > >     return fence;
 > > }
 > >
 > > amdgpu_ib_schedule -> amdgpu_fence_emit
 > >
 > > Here the fence number is based on the fence_driver from the ring itself.
 > >
 > > int amdgpu_fence_emit(struct amdgpu_ring *ring, struct fence **f)
 > > {
 > >     struct amdgpu_device *adev = ring->adev;
 > >     struct amdgpu_fence *fence;
 > >     struct fence *old, **ptr;
 > >     uint32_t seq;
 > >
 > >     fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
 > >     if (fence == NULL)
 > >         return -ENOMEM;
 > >
 > >     seq = ++ring->fence_drv.sync_seq;
 > >     fence->ring = ring;
 > >     fence_init(&fence->base, &amdgpu_fence_ops,
 > >            &ring->fence_drv.lock,
 > >            adev->fence_context + ring->idx,
 > >            seq);
 > >     printf("emitting fence %lu#%u on ring %s\n", adev->fence_context + ring->idx,
 > >            seq, ring->name);
 > >
 > >     amdgpu_ring_emit_fence(ring, ring->fence_drv.gpu_addr,
 > >                    seq, AMDGPU_FENCE_FLAG_INT);
 > >
 > >   >
 > >   >
 > >   >         do {
 > >   >                 last_seq = atomic_read(&ring->fence_drv.last_seq);
 > >   >                 seq = amdgpu_fence_read(ring);
 > >   >
 > >   >     } while (atomic_cmpxchg(&drv->last_seq, last_seq, seq) != last_seq);
 > >   >
 > >   >         if (seq != ring->fence_drv.sync_seq) {
 > >   >         printf("rescheduling fallback for %s\n", ring->name);
 > >   >                 amdgpu_fence_schedule_fallback(ring);
 > >   >         }
 > >
 > > process_fence will not run on a ring unless there are outstanding fences submitted to it:
 > >
 > >   >         if (unlikely(seq == last_seq)) {
 > >   >                 printf("seek == last_seq == %u skipping fence_process\n", seq);
 > >   >                 return;
 > >   >         }
 > >
 > > Dec 28 12:46:35 daleks kernel: creating fence for 75#39 gfx <- last signalled by interrupt below
 > > <...>
 > > Dec 28 12:46:35 daleks kernel: creating fence for 75#40 gfx <-not signalled and falback won't handle it
 > > <...>
 > > Dec 28 12:46:35 daleks kernel: &fence->finished at 263929958993 f 75#37: signaled from irq context
 > > <...>
 > > Dec 28 12:46:35 daleks kernel: emitting fence 0#106 on ring gfx <- these get handle by process
 > > Dec 28 12:46:35 daleks kernel: emitting fence 0#107 on ring gfx  <-
 > > Dec 28 12:46:35 daleks kernel: &fence->scheduled at 263930441405 f 74#39: signaled from irq context
 > > Dec 28 12:46:35 daleks kernel: skipping fallback scheduling for gfx
 > > Dec 28 12:46:35 daleks kernel: rescheduling fallback for gfx
 > > Dec 28 12:46:35 daleks kernel: scheduling fallback for gfx
 > > Dec 28 12:46:35 daleks kernel: fence at 263930474673 f 0#106: signaled from process context
 > > Dec 28 12:46:35 daleks kernel: &fence->finished at 263930569703 f 75#39: signaled from irq context <- last one signalled from this context
 > > Dec 28 12:46:35 daleks kernel: fence at 263930571998 f 0#107: signaled from process context
 > > Dec 28 12:46:35 daleks kernel: creating fence for 75#41 gfx  <- if this isn't handled by an interrupt
 > > Dec 28 12:46:36 daleks kernel: running fence fallback for gfx
 > > Dec 28 12:46:36 daleks kernel: seq == last_seq == 107 skipping fence_process for gfx
 > >
 > >
 > >
 > >
 > >   >  >
 > >   >  > Another possibility is that the memory where the fence is written
 > >   >  > doesn't has the proper attributes (e.g. USWC vs. cached vs. uncached).
 > >   >
 > >   > The only places where I see I memory attributes being set is in amdgpu_device_init for rmmio and the doorbell bar mapping in amdgpu_doorbell_init. The ioremap function will remap the memory uncacheable. The driver is unmodified from Linus' tree as of "drm/amdgpu: add gart recovery by gtt list V2" - about two thirds of the way through 4.9-rc1 (modulo git merge issues). Is there any place else I should be looking? Turning on INVARIANTS which scribbles memory on free (and thus aggressively flushing the cache) causes the hangs to take much much longer to occur - leading me to believe that it may well be a memory typing issue.
 > >
 > >
 > > Thanks in advance.
 > >
 > > -M
 > >
 > >
 > >   > Thanks for getting back to me.
 > >   >
 > >   > -M
 > >   >
 > >   > P.S.
 > >   >
 > >   > A bit of a tangeent - but maybe you could also clarify if I'm doing something wrong when replaying commits from Linus' tree. The way I get the changesets and the sequence is by doing:
 > >   > % git format-patch v4.8..v4.9-rc1 drivers/gpu/drm/*.* drivers/gpu/drm/i915 drivers/gpu/drm/amd drivers/gpu/drm/radeon include/drm include/uapi/drm
 > >   >
 > >   > 'git am' fails much of the time even when there aren't conflicts so what I do is I git cherry-pick the changesets in the order that they show up in the generated patches. I frequently end up with empty commits and sometimes the drivers will not end up with all the requisite changes merged in such that it doesn't compile.
 > >   >
 > >   >
 > >   >
 > >   >
 > >   >  > Regards,
 > >   >  > Christian.
 > >   >  >
 > >   >  > Am 26.12.2016 um 02:54 schrieb Matthew Macy:
 > >   >  > > I'm running an rx460 using the amdgpu driver from Linux 4.8 with Mesa 13/LLVM 3.9 and Xorg 1.18 on FreeBSD. It seems to largely perform pretty well.
 > >   >  > >
 > >   >  > > However, ever since I got Mesa working I will inevitably end up losing completion interrupts after X has been running for a brief period. I can bring the problem on more quickly by running glxgears with vblank_mode=0. It's a safe bet that the problem with the linuxkpi. However, since this bug is manifesting itself in a very hardware specific way I'm coming here for advice on what I can do to dump device state to better understand why it ceases to fire interrupts.
 > >   >  > >
 > >   >  > > I enabled FENCE_TRACE and added some logging to fence creation and fence_default_wait as well. The last interrupt in this particular excerpt is:
 > >   >  > >
 > >   >  > > "Dec 22 22:36:22 daleks kernel: fence at 210850477167 f 0#233745: signaled from irq context"
 > >   >  > >
 > >   >  > > amdgpu_cs_wait goes on to sleep on 411#116530 and never wake up. Any guidance would be much appreciated. Thanks in advance.
 > >   >  > >
 > >   >  > >
 > >   >  > >
 > >   >  > >
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl]
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850212762 f 86#116944: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 864, wptr 880
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850251259 f 411#116528: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: fence at 210850253222 f 0#233742: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 880, wptr 880
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] created fence 410#116529 411#116529 @210850271550
 > >   >  > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 880, wptr 896
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] [drm:gfx_v8_0_eop_irq] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: IH: CP EOP
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850308909 f 87#116944: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: fence at 210850310670 f 0#233743: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 896, wptr 896
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850325151 f 410#116529: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: created fence 86#116945 87#116945 @210850375328
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl]
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850389385 f 86#116945: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] amdgpu_ih_process: rptr 896, wptr 912
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850416620 f 411#116529: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: fence at 210850418382 f 0#233744: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: pid=100793, dev=0xe200, auth=1, AMDGPU_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] created fence 410#116530 411#116530 @210850440720
 > >   >  > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 912, wptr 912
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 912, wptr 928
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP
 > >   >  > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850475397 f 87#116945: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: fence at 210850477167 f 0#233745: signaled from irq context
 > >   >  > > Dec 22 22:36:22 daleks kernel: pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 928, wptr 928
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: created fence 86#116946 87#116946 @210850557790
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: created fence 410#116531 411#116531 @210850614023
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: created fence 86#116947 87#116947 @210850719230
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_WAIT_CS
 > >   >  > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] amdgpu_cs_wait on 411#116530
 > >   >  > > Dec 22 22:36:22 daleks kernel: pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > >   >  > > Dec 22 22:36:22 daleks kernel: 411#116530 sleeping tid 100793 at 210850747487
 > >   >  > >
 > >   >  > >
 > >   >  > > -M
 > >   >  > >
 > >   >  > > _______________________________________________
 > >   >  > > amd-gfx mailing list
 > >   >  > > amd-gfx at lists.freedesktop.org
 > >   >  > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
 > >   >  >
 > >   >  >
 > >   >  > _______________________________________________
 > >   >  > amd-gfx mailing list
 > >   >  > amd-gfx at lists.freedesktop.org
 > >   >  > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
 > >   >  >
 > >   >
 > >   >
 > >   > _______________________________________________
 > >   > amd-gfx mailing list
 > >   > amd-gfx at lists.freedesktop.org
 > >   > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
 > >   >
 > >
 > >
 > 
 > _______________________________________________
 > amd-gfx mailing list
 > amd-gfx at lists.freedesktop.org
 > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
 > 




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux