[PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover

Andrey.Grodzovsky@xxxxxxx (Andrey Grodzovsky) · Wed, 28 Feb 2018 11:40:51 -0500

No new issues found with those patches, testing GPU reset using libdrm 
deadlock detection test on Ellsmire.

The patches are Tested-By: Andrey Grodzovsky <andrey.grodzovsky at amd.com>

P.S

Noticed existing issues (before Monk's patches)

Multiple [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize 
parser !

And occasional unlock imbalance forn amdgpu_cs_ioctl

DEBUG_LOCKS_WARN_ON(depth <= 0)
[Â Â  93.069011 <Â Â Â  0.000017>] WARNING: CPU: 3 PID: 2215 at 
kernel/locking/lockdep.c:3682 lock_release+0x2e8/0x360

On CZ full reset hangs the system.

Gonna take a look at those issues.

Thanks,

Andrey

 From
On 02/28/2018 08:31 AM, Liu, Monk wrote:
> Already sent
>
> -----Original Message-----
> From: Grodzovsky, Andrey
> Sent: 2018å¹´2æ??28æ?¥ 21:31
> To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
>
> Will do once Monk sends V2 forÂ  [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ
>
> Andrey
>
>
> On 02/28/2018 07:20 AM, Christian KÃ¶nig wrote:
>> Andrey please give this set a good testing as well.
>>
>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>> found recover_vram_from_shadow sometimes get executed in paralle with
>>> SDMA scheduler, should stop all schedulers before doing gpu
>>> reset/recover
>>>
>>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>> For now this patch is Reviewed-by: Christian KÃ¶nig
>> <christian.koenig at amd.com>.
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>  Â  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
>>> +++++++++++-------------------
>>>  Â  1 file changed, 15 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 75d1733..e9d81a8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>  Â  Â Â Â Â Â  /* block TTM */
>>>  Â Â Â Â Â  resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>> +
>>>  Â Â Â Â Â  /* store modesetting */
>>>  Â Â Â Â Â  if (amdgpu_device_has_dc_support(adev))
>>>  Â Â Â Â Â Â Â Â Â  state = drm_atomic_helper_suspend(adev->ddev);
>>>  Â  -Â Â Â  /* block scheduler */
>>> +Â Â Â  /* block all schedulers and reset given job's ring */
>>>  Â Â Â Â Â  for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>  Â Â Â Â Â Â Â Â Â  struct amdgpu_ring *ring = adev->rings[i];
>>>  Â  Â Â Â Â Â Â Â Â Â  if (!ring || !ring->sched.thread)
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
>>>  Â  -Â Â Â Â Â Â Â  /* only focus on the ring hit timeout if &job not NULL */
>>> +Â Â Â Â Â Â Â  kthread_park(ring->sched.thread);
>>> +
>>>  Â Â Â Â Â Â Â Â Â  if (job && job->ring->idx != i)
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
>>>  Â  -Â Â Â Â Â Â Â  kthread_park(ring->sched.thread);
>>>  Â Â Â Â Â Â Â Â Â  drm_sched_hw_job_reset(&ring->sched, &job->base);
>>>  Â  Â Â Â Â Â Â Â Â Â  /* after all hw jobs are reset, hw fence is meaningless,
>>> so force_completion */ @@ -2707,33 +2708,22 @@ int
>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â  }
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â  dma_fence_put(fence);
>>>  Â Â Â Â Â Â Â Â Â  }
>>> +Â Â Â  }
>>>  Â  -Â Â Â Â Â Â Â  for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> -Â Â Â Â Â Â Â Â Â Â Â  struct amdgpu_ring *ring = adev->rings[i];
>>> -
>>> -Â Â Â Â Â Â Â Â Â Â Â  if (!ring || !ring->sched.thread)
>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
>>> +Â Â Â  for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> +Â Â Â Â Â Â Â  struct amdgpu_ring *ring = adev->rings[i];
>>>  Â  -Â Â Â Â Â Â Â Â Â Â Â  /* only focus on the ring hit timeout if &job not NULL
>>> */
>>> -Â Â Â Â Â Â Â Â Â Â Â  if (job && job->ring->idx != i)
>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
>>> +Â Â Â Â Â Â Â  if (!ring || !ring->sched.thread)
>>> +Â Â Â Â Â Â Â Â Â Â Â  continue;
>>>  Â  +Â Â Â Â Â Â Â  /* only need recovery sched of the given job's ring
>>> +Â Â Â Â Â Â Â Â  * or all rings (in the case @job is NULL)
>>> +Â Â Â Â Â Â Â Â  * after above amdgpu_reset accomplished
>>> +Â Â Â Â Â Â Â Â  */
>>> +Â Â Â Â Â Â Â  if ((!job || job->ring->idx == i) && !r)
>>>  Â Â Â Â Â Â Â Â Â Â Â Â Â  drm_sched_job_recovery(&ring->sched);
>>> -Â Â Â Â Â Â Â Â Â Â Â  kthread_unpark(ring->sched.thread);
>>> -Â Â Â Â Â Â Â  }
>>> -Â Â Â  } else {
>>> -Â Â Â Â Â Â Â  for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> -Â Â Â Â Â Â Â Â Â Â Â  struct amdgpu_ring *ring = adev->rings[i];
>>>  Â  -Â Â Â Â Â Â Â Â Â Â Â  if (!ring || !ring->sched.thread)
>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
>>> -
>>> -Â Â Â Â Â Â Â Â Â Â Â  /* only focus on the ring hit timeout if &job not NULL
>>> */
>>> -Â Â Â Â Â Â Â Â Â Â Â  if (job && job->ring->idx != i)
>>> -Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
>>> -
>>> -Â Â Â Â Â Â Â Â Â Â Â  kthread_unpark(adev->rings[i]->sched.thread);
>>> -Â Â Â Â Â Â Â  }
>>> +Â Â Â Â Â Â Â  kthread_unpark(ring->sched.thread);
>>>  Â Â Â Â Â  }
>>>  Â  Â Â Â Â Â  if (amdgpu_device_has_dc_support(adev)) {