Am 23.08.2018 um 22:21 schrieb Marek Olšák: > On Thu, Aug 23, 2018 at 2:51 AM Christian König > <ckoenig.leichtzumerken at gmail.com> wrote: >> Am 22.08.2018 um 21:32 schrieb Marek Olšák: >>> On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher at gmail.com> wrote: >>>> On Wed, Aug 22, 2018 at 6:05 AM Christian König >>>> <ckoenig.leichtzumerken at gmail.com> wrote: >>>>> Instead of hammering hard on the GPU try a soft recovery first. >>>>> >>>>> v2: reorder code a bit >>>>> >>>>> Signed-off-by: Christian König <christian.koenig at amd.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 ++++++ >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++ >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 4 ++++ >>>>> 3 files changed, 34 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> index 265ff90f4e01..d93e31a5c4e7 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job) >>>>> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched); >>>>> struct amdgpu_job *job = to_amdgpu_job(s_job); >>>>> >>>>> + if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { >>>>> + DRM_ERROR("ring %s timeout, but soft recovered\n", >>>>> + s_job->sched->name); >>>>> + return; >>>>> + } >>>> I think we should still bubble up the error to userspace even if we >>>> can recover. Data is lost when the wave is killed. We should treat >>>> it like a GPU reset. >>> Yes, please increment gpu_reset_counter, so that we are compliant with >>> OpenGL. Being able to recover from infinite loops is great, but test >>> suites also expect this to be properly reported to userspace via the >>> per-context query. >> Sure that shouldn't be a problem. >> >>> Also please bump the deadline to 1 second. Even you if you kill all >>> shaders, the IB can also contain CP DMA, which may take longer than 1 >>> ms. >> Is there any way we can get a feedback from the SQ if the kill was >> successfully? > I don't think so. The kill should be finished pretty quickly, but more > waves with infinite loops may be waiting to be launched, so you still > need to repeat the kill command. And we should ideally repeat it for 1 > second. > > The reason is that vertex shader waves take a lot of time to launch. A > very very very large draw call can keep launching new waves for 1 > second with the same infinite loop. You would have to soft-reset all > VGTs to stop that. > >> 1 second is way to long, since in the case of a blocked MC we need to >> start up hard reset relative fast. > 10 seconds have already passed. That is a good argument, but I still find 1 second way to long. Keep in mind that we hammer on the SQ here to kill the waves as soon as they are launched. > I think that some hangs from corrupted descriptors may still be > recoverable just by killing waves. Agreed, especially since Vega10 the memory paths became much more robust to error handling. Christian. > > Marek > >> Regards, >> Christian. >> >>> Marek >>> >>> Marek >>> >>>> Alex >>>> >>>>> + >>>>> DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", >>>>> job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), >>>>> ring->fence_drv.sync_seq); >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>>>> index 5dfd26be1eec..c045a4e38ad1 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c >>>>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, >>>>> amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask); >>>>> } >>>>> >>>>> +/** >>>>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup >>>>> + * >>>>> + * @ring: ring to try the recovery on >>>>> + * @vmid: VMID we try to get going again >>>>> + * @fence: timedout fence >>>>> + * >>>>> + * Tries to get a ring proceeding again when it is stuck. >>>>> + */ >>>>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, >>>>> + struct dma_fence *fence) >>>>> +{ >>>>> + ktime_t deadline = ktime_add_us(ktime_get(), 1000); >>>>> + >>>>> + if (!ring->funcs->soft_recovery) >>>>> + return false; >>>>> + >>>>> + while (!dma_fence_is_signaled(fence) && >>>>> + ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) >>>>> + ring->funcs->soft_recovery(ring, vmid); >>>>> + >>>>> + return dma_fence_is_signaled(fence); >>>>> +} >>>>> + >>>>> /* >>>>> * Debugfs info >>>>> */ >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> index 409fdd9b9710..9cc239968e40 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h >>>>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs { >>>>> /* priority functions */ >>>>> void (*set_priority) (struct amdgpu_ring *ring, >>>>> enum drm_sched_priority priority); >>>>> + /* Try to soft recover the ring to make the fence signal */ >>>>> + void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid); >>>>> }; >>>>> >>>>> struct amdgpu_ring { >>>>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring); >>>>> void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, >>>>> uint32_t reg0, uint32_t val0, >>>>> uint32_t reg1, uint32_t val1); >>>>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, >>>>> + struct dma_fence *fence); >>>>> >>>>> static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring) >>>>> { >>>>> -- >>>>> 2.14.1 >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx