How you solve the missing pipe-sync bug I illustrated ? actually original logic Is loose, I found this corner case during massive TDR test couple weeks ago, and it is very hard to catch ... /Monk -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Sunday, November 4, 2018 3:50 AM To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync Am 03.11.18 um 06:33 schrieb Monk Liu: > Reasons to drop it: > > 1) simplify the code: just introduce field member "need_pipe_sync" > for job is good enough to tell if the explicit dependency fence need > followed by a pipeline sync. > > 2) after GPU_recover the explicit fence from sched_syn will not come > back so the required pipeline_sync following it is missed, consider > scenario below: >> now on ring buffer: > Job-A -> pipe_sync -> Job-B >> TDR occured on Job-A, and after GPU recover: >> now on ring buffer: > Job-A -> Job-B > > because the fence from sched_sync is used and freed after ib_schedule > in first time, it will never come back, with this patch this issue > could be avoided. NAK, that would result in a severe performance drop. We need the fence here to determine if we actually need to do the pipeline sync or not. E.g. the explicit requested fence could already be signaled. Christian. > > Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 16 ++++++---------- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++----------- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 +-- > 3 files changed, 10 insertions(+), 23 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > index c48207b3..ac7d2da 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs, > { > struct amdgpu_device *adev = ring->adev; > struct amdgpu_ib *ib = &ibs[0]; > - struct dma_fence *tmp = NULL; > bool skip_preamble, need_ctx_switch; > unsigned patch_offset = ~0; > struct amdgpu_vm *vm; > @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs, > } > > need_ctx_switch = ring->current_ctx != fence_ctx; > - if (ring->funcs->emit_pipeline_sync && job && > - ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) || > - (amdgpu_sriov_vf(adev) && need_ctx_switch) || > - amdgpu_vm_need_pipeline_sync(ring, job))) { > - need_pipe_sync = true; > > - if (tmp) > - trace_amdgpu_ib_pipe_sync(job, tmp); > - > - dma_fence_put(tmp); > + if (ring->funcs->emit_pipeline_sync && job) { > + if ((need_ctx_switch && amdgpu_sriov_vf(adev)) || > + amdgpu_vm_need_pipeline_sync(ring, job)) > + need_pipe_sync = true; > + else if (job->need_pipe_sync) > + need_pipe_sync = true; > } > > if (ring->funcs->insert_start) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index 1d71f8c..dae997d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, > (*job)->num_ibs = num_ibs; > > amdgpu_sync_create(&(*job)->sync); > - amdgpu_sync_create(&(*job)->sched_sync); > (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter); > (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET; > > @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job) > amdgpu_ring_priority_put(ring, s_job->s_priority); > dma_fence_put(job->fence); > amdgpu_sync_free(&job->sync); > - amdgpu_sync_free(&job->sched_sync); > kfree(job); > } > > @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job) > > dma_fence_put(job->fence); > amdgpu_sync_free(&job->sync); > - amdgpu_sync_free(&job->sched_sync); > kfree(job); > } > > @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job, > bool need_pipe_sync = false; > int r; > > - fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync); > - if (fence && need_pipe_sync) { > - if (drm_sched_dependency_optimized(fence, s_entity)) { > - r = amdgpu_sync_fence(ring->adev, &job->sched_sync, > - fence, false); > - if (r) > - DRM_ERROR("Error adding fence (%d)\n", r); > - } > + if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, s_entity)) { > + trace_amdgpu_ib_pipe_sync(job, fence); > + job->need_pipe_sync = true; > } > > while (fence == NULL && vm && !job->vmid) { diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > index e1b46a6..c1d00f0 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > @@ -41,7 +41,6 @@ struct amdgpu_job { > struct drm_sched_job base; > struct amdgpu_vm *vm; > struct amdgpu_sync sync; > - struct amdgpu_sync sched_sync; > struct amdgpu_ib *ibs; > struct dma_fence *fence; /* the hw fence */ > uint32_t preamble_status; > @@ -59,7 +58,7 @@ struct amdgpu_job { > /* user fence handling */ > uint64_t uf_addr; > uint64_t uf_sequence; > - > + bool need_pipe_sync; /* require a pipeline sync for this job */ > }; > > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx