> The question is how much later that is done. My recollection is that we don't reset that for resubmission, but that could be wrong. According to my code (drm-next) VMID's "current_gpu_rest_counter" is updated in "vm_flush" routine, so it is always there for either resubmission or not ... See the code in vm_flush(): if (vm_flush_needed) { mutex_lock(&id_mgr->lock); dma_fence_put(id->last_flush); id->last_flush = dma_fence_get(fence); id->current_gpu_reset_count = atomic_read(&adev->gpu_reset_counter); mutex_unlock(&id_mgr->lock); } /Monk -----Original Message----- From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Liu, Monk Sent: Monday, November 5, 2018 10:22 PM To: Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhou, David(ChunMing) <David1.Zhou@xxxxxxx> Subject: RE: [PATCH 2/3] drm/amdgpu: drop the sched_sync > Anyway I think the cleanest approach to always handle that correctly would be to always insert a vm flush before all jobs on resubmission. That is most likely better for VM flush handling as well. Yeah, that’s true and more simple /Monk -----Original Message----- From: Koenig, Christian Sent: Monday, November 5, 2018 9:59 PM To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhou, David(ChunMing) <David1.Zhou@xxxxxxx> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > and later its VMID's "current_gpu_reset_count" is updated to "adev->gpu_reset_count" The question is how much later that is done. My recollection is that we don't reset that for resubmission, but that could be wrong. Anyway I think the cleanest approach to always handle that correctly would be to always insert a vm flush before all jobs on resubmission. That is most likely better for VM flush handling as well. Christian. Am 05.11.18 um 14:41 schrieb Liu, Monk: > Hi Christian > > For scenario: Bad Job (hang, vmid1) -->Job A (context 10, explicit dep > for Job B, vmid2) --> Job B(context 10, vmid2) --> Job C (context 11, > vmid3) > > Assume "job_hang_limit" is 0, and assume "sched_hw_submission" is 4, I give a second thought on the logic after GPU reset: > > 1) the bad Job would be set guilty and skipped by scheduler, > 2) the first re-submitted job (Job A) would be forced with a > pipeline-sync, > 3) the first re-submitted job (Job A) would be forced with a vm-flush, and later its VMID's "current_gpu_reset_count" is updated to "adev->gpu_reset_count" > 4) the second re-submitted job (Job B, assume it was from the same context of Job A, share the same page table/process, and no vm_update needed ) would not be forced with a pipeline-sync, and neither a vm-flush ... > > Thus for Job B if it has an explicit dep on Job A, this explicit dep would get lost and there will be no pipeline sync inserted prior to Job B ... > > Do you think that's a possible corner case ? > > /Monk > > -----Original Message----- > From: Koenig, Christian > Sent: Monday, November 5, 2018 3:48 PM > To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhou, > David(ChunMing) <David1.Zhou@xxxxxxx> > Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync > > Am 05.11.18 um 08:24 schrieb Liu, Monk: >>> David Zhou had an use case which saw a >10% performance drop the last time he tried it. >> I really don't believe that, because if you insert a WAIT_MEM on an already signaled fence, it only cost GPU couple clocks to move on, right ? no reason to slow down up to 10% ... with 3dmark vulkan version test, the performance is barely different ... with my patch applied ... > Why do you think that we insert a WAIT_MEM on an already signaled fence? > The pipeline sync always wait for the last fence value (because we can't handle wraparounds in PM4). > > So you have a pipeline sync when you don't need one and that is really really bad for things shared between processes, e.g. X/Wayland and it's clients. > > I also expects that this doesn't effect 3dmark at all, but everything which runs in a window which is composed by X could be slowed down massively. > > David do you remember which use case was affected when you tried to drop this optimization? > >>> When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync. >> I will check that point in my branch, I didn't use drm-next, maybe >> there is gap in this part > We had that logic for a very long time now, but we recently simplified it. Could be that there was a bug introduced doing so. > > Maybe we should add a specific flag to run_job() to note that we are re-running a job and then always add VM flushes/pipeline syncs? > > But my main question is why do you see any issues with quark? That is a workaround for an issue for Vulkan sync handling and should only surface when a specific test is run many many times. > > Regards, > Christian. > >> /Monk >> -----Original Message----- >> From: Koenig, Christian >> Sent: Monday, November 5, 2018 3:02 AM >> To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync >> >>> Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? >> When you sync between submission things like composing X windows are slowed down massively. >> >> David Zhou had an use case which saw a >10% performance drop the last time he tried it. >> >>> The problem I hit is during the massive stress test against >>> multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost. >> Well that is really strange. This workaround is only for a very specific Vulkan CTS test which we are still not 100% sure is actually valid. >> >> When a reset happens we flush the VMIDs when re-submitting the jobs to the rings and while doing so we also always do a pipeline sync. >> >> So you should never ever run into any issues in quark with that, even when we completely disable this workaround. >> >> Regards, >> Christian. >> >> Am 04.11.18 um 01:48 schrieb Liu, Monk: >>>> NAK, that would result in a severe performance drop. >>>> We need the fence here to determine if we actually need to do the pipeline sync or not. >>>> E.g. the explicit requested fence could already be signaled. >>> For the performance issue, only insert a WAIT_REG_MEM on GFX/compute ring *doesn't* give the "severe" drop (it's mimic in fact) ... At least I didn't observe any performance drop with 3dmark benchmark (also tested vulkan CTS), Can you tell me which game/benchmark will have performance drop with this fix by your understanding ? let me check it . >>> >>> The problem I hit is during the massive stress test against >>> multi-process + quark , if the quark process hang the engine while there is another two job following the bad job, After the TDR these two job will lose the explicit and the pipeline-sync was also lost. >>> >>> >>> BTW: for original logic, the pipeline sync have another corner case: >>> Assume JobC depend on JobA with explicit flag, and there is jobB inserted in ring: >>> >>> jobA -> jobB -> (pipe sync)JobC >>> >>> if JobA really cost a lot of time to finish, in the >>> amdgpu_ib_schedule() stage you will insert a pipeline sync for JobC against its explicit dependency which is JobA, but there is a JobB between A and C and the pipeline sync of before JobC will wrongly wait on the JobB ... >>> >>> while it is not a big issue but obviously not necessary: C have no >>> relation with B >>> >>> /Monk >>> >>> >>> >>> -----Original Message----- >>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >>> Sent: Sunday, November 4, 2018 3:50 AM >>> To: Liu, Monk <Monk.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Subject: Re: [PATCH 2/3] drm/amdgpu: drop the sched_sync >>> >>> Am 03.11.18 um 06:33 schrieb Monk Liu: >>>> Reasons to drop it: >>>> >>>> 1) simplify the code: just introduce field member "need_pipe_sync" >>>> for job is good enough to tell if the explicit dependency fence >>>> need followed by a pipeline sync. >>>> >>>> 2) after GPU_recover the explicit fence from sched_syn will not >>>> come back so the required pipeline_sync following it is missed, >>>> consider scenario below: >>>>> now on ring buffer: >>>> Job-A -> pipe_sync -> Job-B >>>>> TDR occured on Job-A, and after GPU recover: >>>>> now on ring buffer: >>>> Job-A -> Job-B >>>> >>>> because the fence from sched_sync is used and freed after >>>> ib_schedule in first time, it will never come back, with this patch >>>> this issue could be avoided. >>> NAK, that would result in a severe performance drop. >>> >>> We need the fence here to determine if we actually need to do the pipeline sync or not. >>> >>> E.g. the explicit requested fence could already be signaled. >>> >>> Christian. >>> >>>> Signed-off-by: Monk Liu <Monk.Liu@xxxxxxx> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c | 16 ++++++---------- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 14 +++----------- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 3 +-- >>>> 3 files changed, 10 insertions(+), 23 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>>> index c48207b3..ac7d2da 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ib.c >>>> @@ -122,7 +122,6 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs, >>>> { >>>> struct amdgpu_device *adev = ring->adev; >>>> struct amdgpu_ib *ib = &ibs[0]; >>>> - struct dma_fence *tmp = NULL; >>>> bool skip_preamble, need_ctx_switch; >>>> unsigned patch_offset = ~0; >>>> struct amdgpu_vm *vm; >>>> @@ -166,16 +165,13 @@ int amdgpu_ib_schedule(struct amdgpu_ring *ring, unsigned num_ibs, >>>> } >>>> >>>> need_ctx_switch = ring->current_ctx != fence_ctx; >>>> - if (ring->funcs->emit_pipeline_sync && job && >>>> - ((tmp = amdgpu_sync_get_fence(&job->sched_sync, NULL)) || >>>> - (amdgpu_sriov_vf(adev) && need_ctx_switch) || >>>> - amdgpu_vm_need_pipeline_sync(ring, job))) { >>>> - need_pipe_sync = true; >>>> >>>> - if (tmp) >>>> - trace_amdgpu_ib_pipe_sync(job, tmp); >>>> - >>>> - dma_fence_put(tmp); >>>> + if (ring->funcs->emit_pipeline_sync && job) { >>>> + if ((need_ctx_switch && amdgpu_sriov_vf(adev)) || >>>> + amdgpu_vm_need_pipeline_sync(ring, job)) >>>> + need_pipe_sync = true; >>>> + else if (job->need_pipe_sync) >>>> + need_pipe_sync = true; >>>> } >>>> >>>> if (ring->funcs->insert_start) diff --git >>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> index 1d71f8c..dae997d 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> @@ -71,7 +71,6 @@ int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs, >>>> (*job)->num_ibs = num_ibs; >>>> >>>> amdgpu_sync_create(&(*job)->sync); >>>> - amdgpu_sync_create(&(*job)->sched_sync); >>>> (*job)->vram_lost_counter = atomic_read(&adev->vram_lost_counter); >>>> (*job)->vm_pd_addr = AMDGPU_BO_INVALID_OFFSET; >>>> >>>> @@ -117,7 +116,6 @@ static void amdgpu_job_free_cb(struct drm_sched_job *s_job) >>>> amdgpu_ring_priority_put(ring, s_job->s_priority); >>>> dma_fence_put(job->fence); >>>> amdgpu_sync_free(&job->sync); >>>> - amdgpu_sync_free(&job->sched_sync); >>>> kfree(job); >>>> } >>>> >>>> @@ -127,7 +125,6 @@ void amdgpu_job_free(struct amdgpu_job *job) >>>> >>>> dma_fence_put(job->fence); >>>> amdgpu_sync_free(&job->sync); >>>> - amdgpu_sync_free(&job->sched_sync); >>>> kfree(job); >>>> } >>>> >>>> @@ -182,14 +179,9 @@ static struct dma_fence *amdgpu_job_dependency(struct drm_sched_job *sched_job, >>>> bool need_pipe_sync = false; >>>> int r; >>>> >>>> - fence = amdgpu_sync_get_fence(&job->sync, &need_pipe_sync); >>>> - if (fence && need_pipe_sync) { >>>> - if (drm_sched_dependency_optimized(fence, s_entity)) { >>>> - r = amdgpu_sync_fence(ring->adev, &job->sched_sync, >>>> - fence, false); >>>> - if (r) >>>> - DRM_ERROR("Error adding fence (%d)\n", r); >>>> - } >>>> + if (fence && need_pipe_sync && drm_sched_dependency_optimized(fence, s_entity)) { >>>> + trace_amdgpu_ib_pipe_sync(job, fence); >>>> + job->need_pipe_sync = true; >>>> } >>>> >>>> while (fence == NULL && vm && !job->vmid) { diff --git >>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>>> index e1b46a6..c1d00f0 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h >>>> @@ -41,7 +41,6 @@ struct amdgpu_job { >>>> struct drm_sched_job base; >>>> struct amdgpu_vm *vm; >>>> struct amdgpu_sync sync; >>>> - struct amdgpu_sync sched_sync; >>>> struct amdgpu_ib *ibs; >>>> struct dma_fence *fence; /* the hw fence */ >>>> uint32_t preamble_status; >>>> @@ -59,7 +58,7 @@ struct amdgpu_job { >>>> /* user fence handling */ >>>> uint64_t uf_addr; >>>> uint64_t uf_sequence; >>>> - >>>> + bool need_pipe_sync; /* require a pipeline sync for this job */ >>>> }; >>>> >>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned >>>> num_ibs, _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx