> David Zhou had an use case which saw a >10% performance drop the last time he tried it. I really don't believe that, because if you insert a WAIT_MEM on an already signaled fence, it only cost GPU couple clocks to move on, right ? no reason to slow down up to 10% ... with 3dmark vulkan version test, the performance is barely different ... with my patch applied ... > When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync. I will check that point in my branch, I didn't use drm-next, maybe there is gap in this part /Monk -----Original Message----- From: Koenig, Christian Sent: Monday, November 5, 2018 3:02 AM To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? When you sync between submission things like composing X windows are slowed down massively. David Zhou had an use case which saw a >10% performance drop the last time he tried it. > The problem I hit is during the massive stress test against > multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost. Well that is really strange. This workaround is only for a very specific Vulkan CTS test which we are still not 100% sure is actually valid. When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync. So you should never ever run into any issues in quark with that, even when we completely disable this workaround. Regards, Christian. Am 04.11.18 um 01:48 schrieb Liu, Monk: >> NAK, that would result in a severe performance drop. >> We need the fence here to determine if we actually need to do the pipeline sync or not. >> E.g. the explicit requested fence could already be signaled. > For the performance issue, only insert a WAIT_REG_MEM on GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ... At least I didn't observe any performance drop with 3dmark benchmark (also tested vulkan CTS), Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? let me check it . > > The problem I hit is during the massive stress test against > multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost. > > > BTW: for original logic, the pipeline sync have another corner case: > Assume JobC depend on JobA with explicit flag, and there is jobB inserted in ring: > > jobA -> jobB -> (pipe sync)JobC > > if JobA really cost a lot of time to finish, in the > amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC against its explicit dependency which is JobA, but there is a JobB between A and C and the pipeline sync of before JobC will wrongly wait on the JobB ... > > while it is not a big issue but obviously not necessary: C have no > relation with B > > /Monk > > > > -----Original Message----- > From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > Sent: Sunday, November 4, 2018 3:50 AM > To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > > Am 03.11.18 um 06:33 schrieb Monk Liu: >> Reasons to drop it: >> >> 1) simplify the code: just introduce field member "need_pipe_sync" >> for job is good enough to tell if the explicit dependency fence need >> followed by a pipeline sync. >> >> 2) after GPU_recover the explicit fence from sched_syn will not come >> back so the required pipeline_sync following it is missed, consider >> scenario below: >>> now on ring buffer: >> Job-A -> pipe_sync -> Job-B >>> TDR occured on Job-A, and after GPU recover: >>> now on ring buffer: >> Job-A -> Job-B >> >> because the fence from sched_sync is used and freed after ib_schedule >> in first time, it will never come back, with this patch this issue >> could be avoided. > NAK, that would result in a severe performance drop. > > We need the fence here to determine if we actually need to do the pipeline sync or not. > > E.g. the explicit requested fence could already be signaled. > > Christian. > >> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 16 ++++++---------- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++----------- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 +-- >> 3 files changed, 10 insertions(+), 23 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >> index c48207b3..ac7d2da 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs, >> { >> struct amdgpu_device *adev = ring->adev; >> struct amdgpu_ib *ib = &ibs[0]; >> - struct dma_fence *tmp = NULL; >> bool skip_preamble, need_ctx_switch; >> unsigned patch_offset = ~0; >> struct amdgpu_vm *vm; >> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs, >> } >> >> need_ctx_switch = ring->current_ctx != fence_ctx; >> - if (ring->funcs->emit_pipeline_sync && job && >> - ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) || >> - (amdgpu_sriov_vf(adev) && need_ctx_switch) || >> - amdgpu_vm_need_pipeline_sync(ring, job))) { >> - need_pipe_sync = true; >> >> - if (tmp) >> - trace_amdgpu_ib_pipe_sync(job, tmp); >> - >> - dma_fence_put(tmp); >> + if (ring->funcs->emit_pipeline_sync && job) { >> + if ((need_ctx_switch && amdgpu_sriov_vf(adev)) || >> + amdgpu_vm_need_pipeline_sync(ring, job)) >> + need_pipe_sync = true; >> + else if (job->need_pipe_sync) >> + need_pipe_sync = true; >> } >> >> if (ring->funcs->insert_start) >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> index 1d71f8c..dae997d 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, >> (*job)->num_ibs = num_ibs; >> >> amdgpu_sync_create(&(*job)->sync); >> - amdgpu_sync_create(&(*job)->sched_sync); >> (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter); >> (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET; >> >> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job) >> amdgpu_ring_priority_put(ring, s_job->s_priority); >> dma_fence_put(job->fence); >> amdgpu_sync_free(&job->sync); >> - amdgpu_sync_free(&job->sched_sync); >> kfree(job); >> } >> >> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job) >> >> dma_fence_put(job->fence); >> amdgpu_sync_free(&job->sync); >> - amdgpu_sync_free(&job->sched_sync); >> kfree(job); >> } >> >> @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job, >> bool need_pipe_sync = false; >> int r; >> >> - fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync); >> - if (fence && need_pipe_sync) { >> - if (drm_sched_dependency_optimized(fence, s_entity)) { >> - r = amdgpu_sync_fence(ring->adev, &job->sched_sync, >> - fence, false); >> - if (r) >> - DRM_ERROR("Error adding fence (%d)\n", r); >> - } >> + if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, s_entity)) { >> + trace_amdgpu_ib_pipe_sync(job, fence); >> + job->need_pipe_sync = true; >> } >> >> while (fence == NULL && vm && !job->vmid) { diff --git >> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >> index e1b46a6..c1d00f0 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >> @@ -41,7 +41,6 @@ struct amdgpu_job { >> struct drm_sched_job base; >> struct amdgpu_vm *vm; >> struct amdgpu_sync sync; >> - struct amdgpu_sync sched_sync; >> struct amdgpu_ib *ibs; >> struct dma_fence *fence; /* the hw fence */ >> uint32_t preamble_status; >> @@ -59,7 +58,7 @@ struct amdgpu_job { >> /* user fence handling */ >> uint64_t uf_addr; >> uint64_t uf_sequence; >> - >> + bool need_pipe_sync; /* require a pipeline sync for this job */ >> }; >> >> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx