[PATCH 2/5] drm/amdgpu: add ring soft recovery v2

ckoenig.leichtzumerken@xxxxxxxxx (Christian König) · Fri, 24 Aug 2018 09:28:53 +0200

Am 23.08.2018 um 22:21 schrieb Marek OlÅ¡Ã¡k:
> On Thu, Aug 23, 2018 at 2:51 AM Christian KÃ¶nig
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Am 22.08.2018 um 21:32 schrieb Marek OlÅ¡Ã¡k:
>>> On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>>>> On Wed, Aug 22, 2018 at 6:05 AM Christian KÃ¶nig
>>>> <ckoenig.leichtzumerken at gmail.com> wrote:
>>>>> Instead of hammering hard on the GPU try a soft recovery first.
>>>>>
>>>>> v2: reorder code a bit
>>>>>
>>>>> Signed-off-by: Christian KÃ¶nig <christian.koenig at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++++++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 ++++
>>>>>    3 files changed, 34 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 265ff90f4e01..d93e31a5c4e7 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>>>>>           struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
>>>>>           struct amdgpu_job *job = to_amdgpu_job(s_job);
>>>>>
>>>>> +       if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>>>>> +               DRM_ERROR("ring %s timeout, but soft recovered\n",
>>>>> +                         s_job->sched->name);
>>>>> +               return;
>>>>> +       }
>>>> I think we should still bubble up the error to userspace even if we
>>>> can recover.  Data is lost when the wave is killed.  We should treat
>>>> it like a GPU reset.
>>> Yes, please increment gpu_reset_counter, so that we are compliant with
>>> OpenGL. Being able to recover from infinite loops is great, but test
>>> suites also expect this to be properly reported to userspace via the
>>> per-context query.
>> Sure that shouldn't be a problem.
>>
>>> Also please bump the deadline to 1 second. Even you if you kill all
>>> shaders, the IB can also contain CP DMA, which may take longer than 1
>>> ms.
>> Is there any way we can get a feedback from the SQ if the kill was
>> successfully?
> I don't think so. The kill should be finished pretty quickly, but more
> waves with infinite loops may be waiting to be launched, so you still
> need to repeat the kill command. And we should ideally repeat it for 1
> second.
>
> The reason is that vertex shader waves take a lot of time to launch. A
> very very very large draw call can keep launching new waves for 1
> second with the same infinite loop. You would have to soft-reset all
> VGTs to stop that.
>
>> 1 second is way to long, since in the case of a blocked MC we need to
>> start up hard reset relative fast.
> 10 seconds have already passed.

That is a good argument, but I still find 1 second way to long.

Keep in mind that we hammer on the SQ here to kill the waves as soon as 
they are launched.

> I think that some hangs from corrupted descriptors may still be
> recoverable just by killing waves.

Agreed, especially since Vega10 the memory paths became much more robust 
to error handling.

Christian.

>
> Marek
>
>> Regards,
>> Christian.
>>
>>> Marek
>>>
>>> Marek
>>>
>>>> Alex
>>>>
>>>>> +
>>>>>           DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>>>>>                     job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>>>>>                     ring->fence_drv.sync_seq);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> index 5dfd26be1eec..c045a4e38ad1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
>>>>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
>>>>>           amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
>>>>>    }
>>>>>
>>>>> +/**
>>>>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
>>>>> + *
>>>>> + * @ring: ring to try the recovery on
>>>>> + * @vmid: VMID we try to get going again
>>>>> + * @fence: timedout fence
>>>>> + *
>>>>> + * Tries to get a ring proceeding again when it is stuck.
>>>>> + */
>>>>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>>>>> +                              struct dma_fence *fence)
>>>>> +{
>>>>> +       ktime_t deadline = ktime_add_us(ktime_get(), 1000);
>>>>> +
>>>>> +       if (!ring->funcs->soft_recovery)
>>>>> +               return false;
>>>>> +
>>>>> +       while (!dma_fence_is_signaled(fence) &&
>>>>> +              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>>> +               ring->funcs->soft_recovery(ring, vmid);
>>>>> +
>>>>> +       return dma_fence_is_signaled(fence);
>>>>> +}
>>>>> +
>>>>>    /*
>>>>>     * Debugfs info
>>>>>     */
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> index 409fdd9b9710..9cc239968e40 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
>>>>>           /* priority functions */
>>>>>           void (*set_priority) (struct amdgpu_ring *ring,
>>>>>                                 enum drm_sched_priority priority);
>>>>> +       /* Try to soft recover the ring to make the fence signal */
>>>>> +       void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
>>>>>    };
>>>>>
>>>>>    struct amdgpu_ring {
>>>>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
>>>>>    void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
>>>>>                                                   uint32_t reg0, uint32_t val0,
>>>>>                                                   uint32_t reg1, uint32_t val1);
>>>>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
>>>>> +                              struct dma_fence *fence);
>>>>>
>>>>>    static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
>>>>>    {
>>>>> --
>>>>> 2.14.1
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx