Re: [PATCH 5/5] drm/amdgpu: Follow up change to previous drm scheduler change.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 22.06.22 um 19:19 schrieb Andrey Grodzovsky:

On 2022-06-22 03:17, Christian König wrote:
Am 21.06.22 um 22:00 schrieb Andrey Grodzovsky:

On 2022-06-21 03:28, Christian König wrote:
Am 21.06.22 um 00:03 schrieb Andrey Grodzovsky:
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Could we now also remove job_run_counter?

Christian.


I am afraid not, job counter is needed since at all times the refcount on the embedded fence cannot drop to zero because this will free the job itself before the end of it's life cycle. We have to be able to differentiate in amdgpu_fence_emit between first ever call where we init the embedded fence's refcount from scratch using kref_init to repeating calls when refcount already > 0 and we just fake increase the refcount to align
behavior with pointer style fences in other drivers.

Well what we should probably rather do is move the init out of emit instead.

The only down side I can see is that the sequence number isn't know on initial init and so needs to be zero or something like that.

Regards,
Christian.


Not sure how this help, the problem is not there but in amdgpu_job_run, for embedded fence and resubmit job in pending list amdgpu_job_run will be called twice or even 3 times with recheck guilty job sequence. I am supposed to do dma_fence_init to embeded HW fence only on first call while on second and third only update sequence_num and increase refcount. How can i differentiate between first and non first calls without job_run_counter ?

Yeah, good point. We should really stop re-submitting jobs altogether in the kernel and move that whole functionality into userspace.

Christian.


Andrey




I guess we could assume that embedded fence is all zeroes before first dma_fence_init  if assuming the job itself was allocated using kzalloc and so u can look at dma_fence_ops == NULL or maybe seqno == 0 as a hint if that the fist call or not but it's a risky assumption in my opinion.

Andrey




Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Tested-by: Yiqing Yao <yiqing.yao@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 23 ++++++++++++++++------
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  7 ++++++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 ----
  4 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 513c57f839d8..447bd92c4856 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -684,6 +684,8 @@ int amdgpu_amdkfd_submit_ib(struct amdgpu_device *adev,
          goto err_ib_sched;
      }
  +    /* Drop the initial kref_init count (see drm_sched_main as example) */
+    dma_fence_put(f);
      ret = dma_fence_wait(f, false);
    err_ib_sched:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c99541685804..f9718119834f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5009,16 +5009,28 @@ static void amdgpu_device_recheck_guilty_jobs(             /* clear job's guilty and depend the folowing step to decide the real one */
          drm_sched_reset_karma(s_job);
-        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
-         * to make sure fence is balanced */
-        dma_fence_get(s_job->s_fence->parent);
          drm_sched_resubmit_jobs_ext(&ring->sched, 1);
  +        if (!s_job->s_fence->parent) {
+            DRM_WARN("Failed to get a HW fence for job!");
+            continue;
+        }
+
          ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
          if (ret == 0) { /* timeout */
              DRM_ERROR("Found the real bad job! ring:%s, job_id:%llx\n",
                          ring->sched.name, s_job->id);
  +
+            /* Clear this failed job from fence array */
+            amdgpu_fence_driver_clear_job_fences(ring);
+
+            /* Since the job won't signal and we go for
+             * another resubmit drop this parent pointer
+             */
+            dma_fence_put(s_job->s_fence->parent);
+            s_job->s_fence->parent = NULL;
+
              /* set guilty */
              drm_sched_increase_karma(s_job);
  retry:
@@ -5047,7 +5059,6 @@ static void amdgpu_device_recheck_guilty_jobs(
            /* got the hw fence, signal finished fence */
          atomic_dec(ring->sched.score);
-        dma_fence_put(s_job->s_fence->parent);
dma_fence_get(&s_job->s_fence->finished);
dma_fence_signal(&s_job->s_fence->finished);
dma_fence_put(&s_job->s_fence->finished);
@@ -5220,8 +5231,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
       *
       * job->base holds a reference to parent fence
       */
-    if (job && job->base.s_fence->parent &&
- dma_fence_is_signaled(job->base.s_fence->parent)) {
+    if (job && (job->hw_fence.ops != NULL) &&
+        dma_fence_is_signaled(&job->hw_fence)) {
          job_signaled = true;
          dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
          goto skip_hw_reset;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d6d54ba4c185..9bd4e18212fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,11 +164,16 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct amd
      if (job && job->job_run_counter) {
          /* reinit seq for resubmitted jobs */
          fence->seqno = seq;
+        /* TO be inline with external fence creation and other drivers */
+        dma_fence_get(fence);
      } else {
-        if (job)
+        if (job) {
              dma_fence_init(fence, &amdgpu_job_fence_ops,
                         &ring->fence_drv.lock,
                         adev->fence_context + ring->idx, seq);
+            /* Against remove in amdgpu_job_{free, free_cb} */
+            dma_fence_get(fence);
+        }
          else
              dma_fence_init(fence, &amdgpu_fence_ops,
                         &ring->fence_drv.lock,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 58568fdde2d0..638e1d600258 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -267,10 +267,6 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
              DRM_ERROR("Error scheduling IBs (%d)\n", r);
      }
  -    if (!job->job_run_counter)
-        dma_fence_get(fence);
-    else if (finished->error < 0)
-        dma_fence_put(&job->hw_fence);
      job->job_run_counter++;
      amdgpu_job_free_resources(job);






[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux