Am 22.08.2018 um 21:32 schrieb Marek Olšák: > On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher at gmail.com> wrote: >> On Wed, Aug 22, 2018 at 6:05 AM Christian König >> <ckoenig.leichtzumerken at gmail.com> wrote: >>> Instead of hammering hard on the GPU try a soft recovery first. >>> >>> v2: reorder code a bit >>> >>> Signed-off-by: Christian König <christian.koenig at amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++++ >>> 3 files changed, 34 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> index 265ff90f4e01..d93e31a5c4e7 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >>> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); >>> struct amdgpu_job *job = to_amdgpu_job(s_job); >>> >>> + if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { >>> + DRM_ERROR("ring %s timeout, but soft recovered\n", >>> + s_job->sched->name); >>> + return; >>> + } >> I think we should still bubble up the error to userspace even if we >> can recover. Data is lost when the wave is killed. We should treat >> it like a GPU reset. > Yes, please increment gpu_reset_counter, so that we are compliant with > OpenGL. Being able to recover from infinite loops is great, but test > suites also expect this to be properly reported to userspace via the > per-context query. Sure that shouldn't be a problem. > Also please bump the deadline to 1 second. Even you if you kill all > shaders, the IB can also contain CP DMA, which may take longer than 1 > ms. Is there any way we can get a feedback from the SQ if the kill was successfully? 1 second is way to long, since in the case of a blocked MC we need to start up hard reset relative fast. Regards, Christian. > > Marek > > Marek > >> Alex >> >>> + >>> DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", >>> job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), >>> ring->fence_drv.sync_seq); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> index 5dfd26be1eec..c045a4e38ad1 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, >>> amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask); >>> } >>> >>> +/** >>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup >>> + * >>> + * @ring: ring to try the recovery on >>> + * @vmid: VMID we try to get going again >>> + * @fence: timedout fence >>> + * >>> + * Tries to get a ring proceeding again when it is stuck. >>> + */ >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, >>> + struct dma_fence *fence) >>> +{ >>> + ktime_t deadline = ktime_add_us(ktime_get(), 1000); >>> + >>> + if (!ring->funcs->soft_recovery) >>> + return false; >>> + >>> + while (!dma_fence_is_signaled(fence) && >>> + ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) >>> + ring->funcs->soft_recovery(ring, vmid); >>> + >>> + return dma_fence_is_signaled(fence); >>> +} >>> + >>> /* >>> * Debugfs info >>> */ >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> index 409fdd9b9710..9cc239968e40 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs { >>> /* priority functions */ >>> void (*set_priority) (struct amdgpu_ring *ring, >>> enum drm_sched_priority priority); >>> + /* Try to soft recover the ring to make the fence signal */ >>> + void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid); >>> }; >>> >>> struct amdgpu_ring { >>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring); >>> void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, >>> uint32_t reg0, uint32_t val0, >>> uint32_t reg1, uint32_t val1); >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, >>> + struct dma_fence *fence); >>> >>> static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring) >>> { >>> -- >>> 2.14.1 >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx