[PATCH 2/5] drm/amdgpu: add ring soft recovery v2

ckoenig.leichtzumerken@xxxxxxxxx (Christian König) · Thu, 23 Aug 2018 08:51:25 +0200

Am 22.08.2018 um 21:32 schrieb Marek OlÅ¡Ã¡k:
> On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>> On Wed, Aug 22, 2018 at 6:05 AM Christian KÃ¶nig
>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>> Instead of hammering hard on the GPU try a soft recovery first.
>>>
>>> v2: reorder code a bit
>>>
>>> Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 ++++
>>>   3 files changed, 34 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> index 265ff90f4e01..d93e31a5c4e7 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>          struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>>          struct amdgpu_job *job = to_amdgpu_job(s_job);
>>>
>>> +       if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>>> +               DRM_ERROR("ring %s timeout, but soft recovered\n",
>>> +                         s_job->sched->name);
>>> +               return;
>>> +       }
>> I think we should still bubble up the error to userspace even if we
>> can recover.  Data is lost when the wave is killed.  We should treat
>> it like a GPU reset.
> Yes, please increment gpu_reset_counter, so that we are compliant with
> OpenGL. Being able to recover from infinite loops is great, but test
> suites also expect this to be properly reported to userspace via the
> per-context query.

Sure that shouldn't be a problem.

> Also please bump the deadline to 1 second. Even you if you kill all
> shaders, the IB can also contain CP DMA, which may take longer than 1
> ms.

Is there any way we can get a feedback from the SQ if the kill was 
successfully?

1 second is way to long, since in the case of a blocked MC we need to 
start up hard reset relative fast.

Regards,
Christian.

>
> Marek
>
> Marek
>
>> Alex
>>
>>> +
>>>          DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>>>                    job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>>>                    ring->fence_drv.sync_seq);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> index 5dfd26be1eec..c045a4e38ad1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
>>>          amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
>>>   }
>>>
>>> +/**
>>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
>>> + *
>>> + * @ring: ring to try the recovery on
>>> + * @vmid: VMID we try to get going again
>>> + * @fence: timedout fence
>>> + *
>>> + * Tries to get a ring proceeding again when it is stuck.
>>> + */
>>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>>> +                              struct dma_fence *fence)
>>> +{
>>> +       ktime_t deadline = ktime_add_us(ktime_get(), 1000);
>>> +
>>> +       if (!ring->funcs->soft_recovery)
>>> +               return false;
>>> +
>>> +       while (!dma_fence_is_signaled(fence) &&
>>> +              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>> +               ring->funcs->soft_recovery(ring, vmid);
>>> +
>>> +       return dma_fence_is_signaled(fence);
>>> +}
>>> +
>>>   /*
>>>    * Debugfs info
>>>    */
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> index 409fdd9b9710..9cc239968e40 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
>>>          /* priority functions */
>>>          void (*set_priority) (struct amdgpu_ring *ring,
>>>                                enum drm_sched_priority priority);
>>> +       /* Try to soft recover the ring to make the fence signal */
>>> +       void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
>>>   };
>>>
>>>   struct amdgpu_ring {
>>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
>>>   void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
>>>                                                  uint32_t reg0, uint32_t val0,
>>>                                                  uint32_t reg1, uint32_t val1);
>>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>>> +                              struct dma_fence *fence);
>>>
>>>   static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
>>>   {
>>> --
>>> 2.14.1
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx