On Fri Jul 23, 2021 at 10:45:49AM +0200, Christian König wrote: > Am 23.07.21 um 09:07 schrieb Jingwen Chen: > > [SNIP] > > Hi Christian, > > > > The thing is vm flush fence has no job passed to amdgpu_fence_emit, so > > go through the jobs cannot help find the vm flush fence. And keep the > > rest fences in the RCU array will lead to signaling them in the ib_test > > right after ASIC reset. While they will be signaled again during > > resubmission. What I'm doning here is just trying to cleanup the fences > > without a parent job and make sure the rest fences won't be signaled > > twice. > > It took me a moment to realize what you are talking about here. > > This is for the KIQ! You need different handling for the KIQ than for the > scheduler controlled rings. > > It is not only the flush jobs, but the KIQ needs a complete reset because of > the register writes pushed there as well. > > > > And please drop any usage of DMA_FENCE_FLAG_USER_BITS. That is not something > > > which should be abused for reset handling. > > > > > The DMA_FENCE_FLAG_USER_BITS here is to mark this fence has a parent > > job. If this is not a proper usage here, do you have any suggestions > > about how to identify whether the fence has a parent job? > > You don't need to mark the fences at all. Everything on the KIQ ring needs > to be force completed since none of the fences on that ring have a parent > job. > > Christian. > Hi Christian Yes KIQ ring fences all need force_completion. But we do need to mark the fences. Say we have a gfx ring job with vm_flush=1 sending to amdgpu_ib_schedule, then in amdgpu_vm_flush, after the amdgpu_ring_emit_vm_flush is done, we will create a vm flush fence on gfx ring with amdgpu_fence_emit(ring, &fence, NULL, 0). Then this vm flush fence we create here has no parent job but it's on gfx ring. If we only do force_completion for KIQ ring and just clear the RCU array for the scheduler controlled rings, nobody will signal and put this gfx ring vm_flush fence again. When this job is resubmitted, it will just create a new vm_flush fence. This is a memleak and I have seen this memleak during my test. Best Regards, JingWen Chen > > > > Best Regards, > > JingWen Chen > > > Regards, > > > Christian. > > > > > > > > > > > Best Regards, > > > > JingWen Chen > > > > > > Andrey > > > > > > > > > > > > > > > > > > > > > + } > > > > > > > > > /* after all hw jobs are reset, hw fence is > > > > > > > > > meaningless, so force_completion */ > > > > > > > > > amdgpu_fence_driver_force_completion(ring); > > > > > > > > > } > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > > > > > > > > index eecf21d8ec33..815776c9a013 100644 > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > > > > > > > > > @@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct > > > > > > > > > amdgpu_ring *ring, struct dma_fence **f, struct amd > > > > > > > > > job->ring = ring; > > > > > > > > > } > > > > > > > > > - seq = ++ring->fence_drv.sync_seq; > > > > > > > > > - dma_fence_init(fence, &amdgpu_fence_ops, > > > > > > > > > - &ring->fence_drv.lock, > > > > > > > > > - adev->fence_context + ring->idx, > > > > > > > > > - seq); > > > > > > > > > + if (job != NULL && job->base.resubmit_flag == 1) { > > > > > > > > > + /* reinit seq for resubmitted jobs */ > > > > > > > > > + seq = ++ring->fence_drv.sync_seq; > > > > > > > > > + fence->seqno = seq; > > > > > > > > > + } else { > > > > > > > > > + seq = ++ring->fence_drv.sync_seq; > > > > > > > > Seems like you could do the above line only once above if-else > > > > > > > > as it was > > > > > > > > before > > > > > > > Sure, I will modify this. > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > JingWen Chen > > > > > > > > > + dma_fence_init(fence, &amdgpu_fence_ops, > > > > > > > > > + &ring->fence_drv.lock, > > > > > > > > > + adev->fence_context + ring->idx, > > > > > > > > > + seq); > > > > > > > > > + } > > > > > > > > > if (job != NULL) { > > > > > > > > > /* mark this fence has a parent job */ > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > > > > > > > index 7c426e225b24..d6f848adc3f4 100644 > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > > > > > > > > @@ -241,6 +241,7 @@ static struct dma_fence > > > > > > > > > *amdgpu_job_run(struct drm_sched_job *sched_job) > > > > > > > > > dma_fence_set_error(finished, -ECANCELED);/* skip > > > > > > > > > IB as well if VRAM lost */ > > > > > > > > > if (finished->error < 0) { > > > > > > > > > + dma_fence_put(&job->hw_fence); > > > > > > > > > DRM_INFO("Skip scheduling IBs!\n"); > > > > > > > > > } else { > > > > > > > > > r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs, job, > > > > > > > > > @@ -249,7 +250,8 @@ static struct dma_fence > > > > > > > > > *amdgpu_job_run(struct drm_sched_job *sched_job) > > > > > > > > > DRM_ERROR("Error scheduling IBs (%d)\n", r); > > > > > > > > > } > > > > > > > > > - dma_fence_get(fence); > > > > > > > > > + if (!job->base.resubmit_flag) > > > > > > > > > + dma_fence_get(fence); > > > > > > > > > amdgpu_job_free_resources(job); > > > > > > > > > fence = r ? ERR_PTR(r) : fence; > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > index f4f474944169..5a36ab5aea2d 100644 > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > @@ -544,6 +544,7 @@ void drm_sched_resubmit_jobs_ext(struct > > > > > > > > > drm_gpu_scheduler *sched, int max) > > > > > > > > > dma_fence_set_error(&s_fence->finished, -ECANCELED); > > > > > > > > > dma_fence_put(s_job->s_fence->parent); > > > > > > > > > + s_job->resubmit_flag = 1; > > > > > > > > > fence = sched->ops->run_job(s_job); > > > > > > > > > i++; > > > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > > > > > > > > index 4ea8606d91fe..06c101af1f71 100644 > > > > > > > > > --- a/include/drm/gpu_scheduler.h > > > > > > > > > +++ b/include/drm/gpu_scheduler.h > > > > > > > > > @@ -198,6 +198,7 @@ struct drm_sched_job { > > > > > > > > > enum drm_sched_priority s_priority; > > > > > > > > > struct drm_sched_entity *entity; > > > > > > > > > struct dma_fence_cb cb; > > > > > > > > > + int resubmit_flag; > > > > > > > > > }; > > > > > > > > > static inline bool drm_sched_invalidate_job(struct > > > > > > > > > drm_sched_job *s_job, > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx