Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Fri, 23 Jul 2021 11:44:27 +0200

Am 23.07.21 um 11:25 schrieb Jingwen Chen:
On Fri Jul 23, 2021 at 10:45:49AM +0200, Christian König wrote:
Am 23.07.21 um 09:07 schrieb Jingwen Chen:
[SNIP]
Hi Christian,

The thing is vm flush fence has no job passed to amdgpu_fence_emit, so
go through the jobs cannot help find the vm flush fence. And keep the
rest fences in the RCU array will lead to signaling them in the ib_test
right after ASIC reset. While they will be signaled again during
resubmission. What I'm doning here is just trying to cleanup the fences
without a parent job and make sure the rest fences won't be signaled
twice.
It took me a moment to realize what you are talking about here.

This is for the KIQ! You need different handling for the KIQ than for the
scheduler controlled rings.

It is not only the flush jobs, but the KIQ needs a complete reset because of
the register writes pushed there as well.

And please drop any usage of DMA_FENCE_FLAG_USER_BITS. That is not something
which should be abused for reset handling.

The DMA_FENCE_FLAG_USER_BITS here is to mark this fence has a parent
job. If this is not a proper usage here, do you have any suggestions
about how to identify whether the fence has a parent job?
You don't need to mark the fences at all. Everything on the KIQ ring needs
to be force completed since none of the fences on that ring have a parent
job.

Christian.

Hi Christian

Yes KIQ ring fences all need force_completion. But we do need to mark the
fences. Say we have a gfx ring job with vm_flush=1 sending to
amdgpu_ib_schedule, then in amdgpu_vm_flush, after the
amdgpu_ring_emit_vm_flush is done, we will create a vm flush fence on
gfx ring with amdgpu_fence_emit(ring, &fence, NULL, 0).

That is illegal and needs to be fixed as well.

Then this vm flush fence we create here has no parent job but it's on
gfx ring.
If we only do force_completion for KIQ ring and just clear the
RCU array for the scheduler controlled rings, nobody will signal and put this
gfx ring vm_flush fence again. When this job is resubmitted, it will
just create a new vm_flush fence. This is a memleak and I have seen this
memleak during my test.

At least I'm better understanding the problem now as well.

But you are just trying to circumvent symptoms and not fixing the root 
cause.

See any memory allocation during command submission must be prevented. 
This also includes the flush fence.

Regards,
Christian.


Best Regards,
JingWen Chen
Best Regards,
JingWen Chen
Regards,
Christian.

Best Regards,
JingWen Chen
Andrey


+        }
              /* after all hw jobs are reset, hw fence is
meaningless, so force_completion */
              amdgpu_fence_driver_force_completion(ring);
          }

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index eecf21d8ec33..815776c9a013 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct
amdgpu_ring *ring, struct dma_fence **f, struct amd
              job->ring = ring;
          }
-    seq = ++ring->fence_drv.sync_seq;
-    dma_fence_init(fence, &amdgpu_fence_ops,
-               &ring->fence_drv.lock,
-               adev->fence_context + ring->idx,
-               seq);
+    if (job != NULL && job->base.resubmit_flag == 1) {
+        /* reinit seq for resubmitted jobs */
+        seq = ++ring->fence_drv.sync_seq;
+        fence->seqno = seq;
+    } else {
+        seq = ++ring->fence_drv.sync_seq;
Seems like you could do the above line only once above if-else
as it was
before
Sure, I will modify this.


Best Regards,
JingWen Chen
+        dma_fence_init(fence, &amdgpu_fence_ops,
+                &ring->fence_drv.lock,
+                adev->fence_context + ring->idx,
+                seq);
+    }
          if (job != NULL) {
              /* mark this fence has a parent job */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 7c426e225b24..d6f848adc3f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -241,6 +241,7 @@ static struct dma_fence
*amdgpu_job_run(struct drm_sched_job *sched_job)
              dma_fence_set_error(finished, -ECANCELED);/* skip
IB as well if VRAM lost */
          if (finished->error < 0) {
+        dma_fence_put(&job->hw_fence);
              DRM_INFO("Skip scheduling IBs!\n");
          } else {
              r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job,
@@ -249,7 +250,8 @@ static struct dma_fence
*amdgpu_job_run(struct drm_sched_job *sched_job)
                  DRM_ERROR("Error scheduling IBs (%d)\n", r);
          }
-    dma_fence_get(fence);
+    if (!job->base.resubmit_flag)
+        dma_fence_get(fence);
          amdgpu_job_free_resources(job);
          fence = r ? ERR_PTR(r) : fence;
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index f4f474944169..5a36ab5aea2d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -544,6 +544,7 @@ void drm_sched_resubmit_jobs_ext(struct
drm_gpu_scheduler *sched, int max)
dma_fence_set_error(&s_fence->finished, -ECANCELED);
              dma_fence_put(s_job->s_fence->parent);
+        s_job->resubmit_flag = 1;
              fence = sched->ops->run_job(s_job);
              i++;
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 4ea8606d91fe..06c101af1f71 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -198,6 +198,7 @@ struct drm_sched_job {
          enum drm_sched_priority        s_priority;
          struct drm_sched_entity         *entity;
          struct dma_fence_cb        cb;
+    int resubmit_flag;
      };
      static inline bool drm_sched_invalidate_job(struct
drm_sched_job *s_job,

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx