No new issues found with those patches, testing GPU reset using libdrm deadlock detection test on Ellsmire. The patches are Tested-By: Andrey Grodzovsky <andrey.grodzovsky at amd.com> P.S Noticed existing issues (before Monk's patches) Multiple [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize parser ! And occasional unlock imbalance forn amdgpu_cs_ioctl DEBUG_LOCKS_WARN_ON(depth <= 0) [  93.069011 <   0.000017>] WARNING: CPU: 3 PID: 2215 at kernel/locking/lockdep.c:3682 lock_release+0x2e8/0x360 On CZ full reset hangs the system. Gonna take a look at those issues. Thanks, Andrey From On 02/28/2018 08:31 AM, Liu, Monk wrote: > Already sent > > -----Original Message----- > From: Grodzovsky, Andrey > Sent: 2018å¹´2æ??28æ?¥ 21:31 > To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org > Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover > > Will do once Monk sends V2 for [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ > > Andrey > > > On 02/28/2018 07:20 AM, Christian König wrote: >> Andrey please give this set a good testing as well. >> >> Am 28.02.2018 um 08:21 schrieb Monk Liu: >>> found recover_vram_from_shadow sometimes get executed in paralle with >>> SDMA scheduler, should stop all schedulers before doing gpu >>> reset/recover >>> >>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3 >>> Signed-off-by: Monk Liu <Monk.Liu at amd.com> >> For now this patch is Reviewed-by: Christian König >> <christian.koenig at amd.com>. >> >> Regards, >> Christian. >> >>> --- >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 >>> +++++++++++------------------- >>>  1 file changed, 15 insertions(+), 25 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 75d1733..e9d81a8 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct >>> amdgpu_device *adev, >>>       /* block TTM */ >>>      resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev); >>> + >>>      /* store modesetting */ >>>      if (amdgpu_device_has_dc_support(adev)) >>>          state = drm_atomic_helper_suspend(adev->ddev); >>>  -   /* block scheduler */ >>> +   /* block all schedulers and reset given job's ring */ >>>      for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>>          struct amdgpu_ring *ring = adev->rings[i]; >>>           if (!ring || !ring->sched.thread) >>>              continue; >>>  -       /* only focus on the ring hit timeout if &job not NULL */ >>> +       kthread_park(ring->sched.thread); >>> + >>>          if (job && job->ring->idx != i) >>>              continue; >>>  -       kthread_park(ring->sched.thread); >>>          drm_sched_hw_job_reset(&ring->sched, &job->base); >>>           /* after all hw jobs are reset, hw fence is meaningless, >>> so force_completion */ @@ -2707,33 +2708,22 @@ int >>> amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>>              } >>>              dma_fence_put(fence); >>>          } >>> +   } >>>  -       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> -           struct amdgpu_ring *ring = adev->rings[i]; >>> - >>> -           if (!ring || !ring->sched.thread) >>> -               continue; >>> +   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> +       struct amdgpu_ring *ring = adev->rings[i]; >>>  -           /* only focus on the ring hit timeout if &job not NULL >>> */ >>> -           if (job && job->ring->idx != i) >>> -               continue; >>> +       if (!ring || !ring->sched.thread) >>> +           continue; >>>  +       /* only need recovery sched of the given job's ring >>> +        * or all rings (in the case @job is NULL) >>> +        * after above amdgpu_reset accomplished >>> +        */ >>> +       if ((!job || job->ring->idx == i) && !r) >>>              drm_sched_job_recovery(&ring->sched); >>> -           kthread_unpark(ring->sched.thread); >>> -       } >>> -   } else { >>> -       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> -           struct amdgpu_ring *ring = adev->rings[i]; >>>  -           if (!ring || !ring->sched.thread) >>> -               continue; >>> - >>> -           /* only focus on the ring hit timeout if &job not NULL >>> */ >>> -           if (job && job->ring->idx != i) >>> -               continue; >>> - >>> -           kthread_unpark(adev->rings[i]->sched.thread); >>> -       } >>> +       kthread_unpark(ring->sched.thread); >>>      } >>>       if (amdgpu_device_has_dc_support(adev)) {