On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher at gmail.com> wrote: > > On Wed, Aug 22, 2018 at 6:05 AM Christian König > <ckoenig.leichtzumerken at gmail.com> wrote: > > > > Instead of hammering hard on the GPU try a soft recovery first. > > > > v2: reorder code a bit > > > > Signed-off-by: Christian König <christian.koenig at amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++++ > > 3 files changed, 34 insertions(+) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > index 265ff90f4e01..d93e31a5c4e7 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) > > struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); > > struct amdgpu_job *job = to_amdgpu_job(s_job); > > > > + if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { > > + DRM_ERROR("ring %s timeout, but soft recovered\n", > > + s_job->sched->name); > > + return; > > + } > > I think we should still bubble up the error to userspace even if we > can recover. Data is lost when the wave is killed. We should treat > it like a GPU reset. Yes, please increment gpu_reset_counter, so that we are compliant with OpenGL. Being able to recover from infinite loops is great, but test suites also expect this to be properly reported to userspace via the per-context query. Also please bump the deadline to 1 second. Even you if you kill all shaders, the IB can also contain CP DMA, which may take longer than 1 ms. Marek Marek > > Alex > > > + > > DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", > > job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), > > ring->fence_drv.sync_seq); > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > > index 5dfd26be1eec..c045a4e38ad1 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c > > @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, > > amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask); > > } > > > > +/** > > + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup > > + * > > + * @ring: ring to try the recovery on > > + * @vmid: VMID we try to get going again > > + * @fence: timedout fence > > + * > > + * Tries to get a ring proceeding again when it is stuck. > > + */ > > +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, > > + struct dma_fence *fence) > > +{ > > + ktime_t deadline = ktime_add_us(ktime_get(), 1000); > > + > > + if (!ring->funcs->soft_recovery) > > + return false; > > + > > + while (!dma_fence_is_signaled(fence) && > > + ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) > > + ring->funcs->soft_recovery(ring, vmid); > > + > > + return dma_fence_is_signaled(fence); > > +} > > + > > /* > > * Debugfs info > > */ > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > index 409fdd9b9710..9cc239968e40 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h > > @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs { > > /* priority functions */ > > void (*set_priority) (struct amdgpu_ring *ring, > > enum drm_sched_priority priority); > > + /* Try to soft recover the ring to make the fence signal */ > > + void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid); > > }; > > > > struct amdgpu_ring { > > @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring); > > void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, > > uint32_t reg0, uint32_t val0, > > uint32_t reg1, uint32_t val1); > > +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, > > + struct dma_fence *fence); > > > > static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring) > > { > > -- > > 2.14.1 > > > > _______________________________________________ > > amd-gfx mailing list > > amd-gfx at lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx