> -----Original Message----- > From: Koenig, Christian > Sent: Monday, November 05, 2018 3:48 PM > To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhou, > David(ChunMing) <David1.Zhou@xxxxxxx> > Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > > Am 05.11.18 um 08:24 schrieb Liu, Monk: > >> David Zhou had an use case which saw a >10% performance drop the last > time he tried it. > > I really don't believe that, because if you insert a WAIT_MEM on an already > signaled fence, it only cost GPU couple clocks to move on, right ? no reason > to slow down up to 10% ... with 3dmark vulkan version test, the performance > is barely different ... with my patch applied ... > > Why do you think that we insert a WAIT_MEM on an already signaled fence? > The pipeline sync always wait for the last fence value (because we can't > handle wraparounds in PM4). > > So you have a pipeline sync when you don't need one and that is really really > bad for things shared between processes, e.g. X/Wayland and it's clients. > > I also expects that this doesn't effect 3dmark at all, but everything which runs > in a window which is composed by X could be slowed down massively. > > David do you remember which use case was affected when you tried to drop > this optimization? That was a long time ago, I remember Andrey also tried to remove sched_sync before, but he eventually kept it, right? >From Monk's patch, seems he doesn't change main logic, he just moved sched_sync logic to job->need_pipe_sync. But at least, I can see a bit effect, e.g. job process evaluates fence to sched_sync, but the fence could be signaled when amdgpu_ib_schedule, then don't need insert pipeline sync. Anyway, this is a sensitive path, we should change it carefully, we should give a wide test. Regards, David Zhou > > >> When a reset happens we flush the VMIDs when re-submitting the jobs > to the rings and while doing so we also always do a pipeline sync. > > I will check that point in my branch, I didn't use drm-next, maybe > > there is gap in this part > > We had that logic for a very long time now, but we recently simplified it. > Could be that there was a bug introduced doing so. > > Maybe we should add a specific flag to run_job() to note that we are re- > running a job and then always add VM flushes/pipeline syncs? > > But my main question is why do you see any issues with quark? That is a > workaround for an issue for Vulkan sync handling and should only surface > when a specific test is run many many times. > > Regards, > Christian. > > > > > /Monk > > -----Original Message----- > > From: Koenig, Christian > > Sent: Monday, November 5, 2018 3:02 AM > > To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > > > >> Can you tell me which game/benchmark will have performance drop with > this fix by your understanding ? > > When you sync between submission things like composing X windows are > slowed down massively. > > > > David Zhou had an use case which saw a >10% performance drop the last > time he tried it. > > > >> The problem I hit is during the massive stress test against > >> multi-process + quark , if the quark process hang the engine while there is > another two job following the bad job, After the TDR these two job will lose > the explicit and the pipeline-sync was also lost. > > Well that is really strange. This workaround is only for a very specific Vulkan > CTS test which we are still not 100% sure is actually valid. > > > > When a reset happens we flush the VMIDs when re-submitting the jobs to > the rings and while doing so we also always do a pipeline sync. > > > > So you should never ever run into any issues in quark with that, even when > we completely disable this workaround. > > > > Regards, > > Christian. > > > > Am 04.11.18 um 01:48 schrieb Liu, Monk: > >>> NAK, that would result in a severe performance drop. > >>> We need the fence here to determine if we actually need to do the > pipeline sync or not. > >>> E.g. the explicit requested fence could already be signaled. > >> For the performance issue, only insert a WAIT_REG_MEM on > GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ... At > least I didn't observe any performance drop with 3dmark benchmark (also > tested vulkan CTS), Can you tell me which game/benchmark will have > performance drop with this fix by your understanding ? let me check it . > >> > >> The problem I hit is during the massive stress test against > >> multi-process + quark , if the quark process hang the engine while there is > another two job following the bad job, After the TDR these two job will lose > the explicit and the pipeline-sync was also lost. > >> > >> > >> BTW: for original logic, the pipeline sync have another corner case: > >> Assume JobC depend on JobA with explicit flag, and there is jobB inserted > in ring: > >> > >> jobA -> jobB -> (pipe sync)JobC > >> > >> if JobA really cost a lot of time to finish, in the > >> amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC > against its explicit dependency which is JobA, but there is a JobB between A > and C and the pipeline sync of before JobC will wrongly wait on the JobB ... > >> > >> while it is not a big issue but obviously not necessary: C have no > >> relation with B > >> > >> /Monk > >> > >> > >> > >> -----Original Message----- > >> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> > >> Sent: Sunday, November 4, 2018 3:50 AM > >> To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > >> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > >> > >> Am 03.11.18 um 06:33 schrieb Monk Liu: > >>> Reasons to drop it: > >>> > >>> 1) simplify the code: just introduce field member "need_pipe_sync" > >>> for job is good enough to tell if the explicit dependency fence need > >>> followed by a pipeline sync. > >>> > >>> 2) after GPU_recover the explicit fence from sched_syn will not come > >>> back so the required pipeline_sync following it is missed, consider > >>> scenario below: > >>>> now on ring buffer: > >>> Job-A -> pipe_sync -> Job-B > >>>> TDR occured on Job-A, and after GPU recover: > >>>> now on ring buffer: > >>> Job-A -> Job-B > >>> > >>> because the fence from sched_sync is used and freed after > >>> ib_schedule in first time, it will never come back, with this patch > >>> this issue could be avoided. > >> NAK, that would result in a severe performance drop. > >> > >> We need the fence here to determine if we actually need to do the > pipeline sync or not. > >> > >> E.g. the explicit requested fence could already be signaled. > >> > >> Christian. > >> > >>> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 16 ++++++---------- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++----------- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 +-- > >>> 3 files changed, 10 insertions(+), 23 deletions(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>> index c48207b3..ac7d2da 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c > >>> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring > *ring, unsigned num_ibs, > >>> { > >>> struct amdgpu_device *adev = ring->adev; > >>> struct amdgpu_ib *ib = &ibs[0]; > >>> - struct dma_fence *tmp = NULL; > >>> bool skip_preamble, need_ctx_switch; > >>> unsigned patch_offset = ~0; > >>> struct amdgpu_vm *vm; > >>> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring > *ring, unsigned num_ibs, > >>> } > >>> > >>> need_ctx_switch = ring->current_ctx != fence_ctx; > >>> - if (ring->funcs->emit_pipeline_sync && job && > >>> - ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) || > >>> - (amdgpu_sriov_vf(adev) && need_ctx_switch) || > >>> - amdgpu_vm_need_pipeline_sync(ring, job))) { > >>> - need_pipe_sync = true; > >>> > >>> - if (tmp) > >>> - trace_amdgpu_ib_pipe_sync(job, tmp); > >>> - > >>> - dma_fence_put(tmp); > >>> + if (ring->funcs->emit_pipeline_sync && job) { > >>> + if ((need_ctx_switch && amdgpu_sriov_vf(adev)) || > >>> + amdgpu_vm_need_pipeline_sync(ring, job)) > >>> + need_pipe_sync = true; > >>> + else if (job->need_pipe_sync) > >>> + need_pipe_sync = true; > >>> } > >>> > >>> if (ring->funcs->insert_start) > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>> index 1d71f8c..dae997d 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > >>> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, > unsigned num_ibs, > >>> (*job)->num_ibs = num_ibs; > >>> > >>> amdgpu_sync_create(&(*job)->sync); > >>> - amdgpu_sync_create(&(*job)->sched_sync); > >>> (*job)->vram_lost_counter = atomic_read(&adev- > >vram_lost_counter); > >>> (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET; > >>> > >>> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct > drm_sched_job *s_job) > >>> amdgpu_ring_priority_put(ring, s_job->s_priority); > >>> dma_fence_put(job->fence); > >>> amdgpu_sync_free(&job->sync); > >>> - amdgpu_sync_free(&job->sched_sync); > >>> kfree(job); > >>> } > >>> > >>> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job) > >>> > >>> dma_fence_put(job->fence); > >>> amdgpu_sync_free(&job->sync); > >>> - amdgpu_sync_free(&job->sched_sync); > >>> kfree(job); > >>> } > >>> > >>> @@ -182,14 +179,9 @@ static struct dma_fence > *amdgpu_job_dependency(struct drm_sched_job *sched_job, > >>> bool need_pipe_sync = false; > >>> int r; > >>> > >>> - fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync); > >>> - if (fence && need_pipe_sync) { > >>> - if (drm_sched_dependency_optimized(fence, s_entity)) { > >>> - r = amdgpu_sync_fence(ring->adev, &job- > >sched_sync, > >>> - fence, false); > >>> - if (r) > >>> - DRM_ERROR("Error adding fence (%d)\n", r); > >>> - } > >>> + if (fence && need_pipe_sync && > drm_sched_dependency_optimized(fence, s_entity)) { > >>> + trace_amdgpu_ib_pipe_sync(job, fence); > >>> + job->need_pipe_sync = true; > >>> } > >>> > >>> while (fence == NULL && vm && !job->vmid) { diff --git > >>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>> index e1b46a6..c1d00f0 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h > >>> @@ -41,7 +41,6 @@ struct amdgpu_job { > >>> struct drm_sched_job base; > >>> struct amdgpu_vm *vm; > >>> struct amdgpu_sync sync; > >>> - struct amdgpu_sync sched_sync; > >>> struct amdgpu_ib *ibs; > >>> struct dma_fence *fence; /* the hw fence */ > >>> uint32_t preamble_status; > >>> @@ -59,7 +58,7 @@ struct amdgpu_job { > >>> /* user fence handling */ > >>> uint64_t uf_addr; > >>> uint64_t uf_sequence; > >>> - > >>> + bool need_pipe_sync; /* require a pipeline sync for this job */ > >>> }; > >>> > >>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned > >>> num_ibs, _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx