Hi Christian, Thanks for your good idea, will try this. Best Wishes, Emily Deng > -----Original Message----- > From: Koenig, Christian > Sent: Thursday, March 29, 2018 5:05 PM > To: Liu, Monk <Monk.Liu at amd.com>; Deng, Emily <Emily.Deng at amd.com>; > amd-gfx at lists.freedesktop.org > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang > > I think the core of the problem is that we don't abort all entities of the > process at the same time. > > How about splitting drm_sched_entity_fini() into two functions? > > The first one is does the waiting, removes the entity from the runqueue and > returns an error when the process was killed. > > During destruction this one is called first for all contexts as well as the entity > for VM updates. > > The second one then goes over the entity and signals all jobs with an error > code. > > This way no VM updates should be executed any longer after the process is > killed (something which doesn't makes sense anyway and just costs us time). > > Regards, > Christian. > > Am 29.03.2018 um 06:14 schrieb Liu, Monk: > >> 2)if a fence signaled and try to clear some entity's dependency, > >> should set this entity guilty to prevent its job really run since the > >> dependency is fake signaled. > > Well, that is a clear NAK. It would mean that you mark things like the X > server or Wayland queue is marked guilty because some client is killed. > > > > And since unmapping/clear jobs don't have a guilty pointer it should > actually not have any effect on the issue. > > > > > > [ML], yeah that's a good point, how about this way: if a fence is fake > > signaled and try to clear other entity's dependency we only allow entity > marked as guilty If this entity share the same ctx (or even process?) of the > entity of the job of that fake signaled fence ? > > This way for a certain process a faked signaled GFX fence won't be > > able to wake another VCE/SDMA job > > > > /Monk > > > > -----Original Message----- > > From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com] > > Sent: 2018å¹´3æ??28æ?¥ 19:57 > > To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org > > Cc: Liu, Monk <Monk.Liu at amd.com> > > Subject: Re: [PATCH] drm/gpu-sched: fix force APP kill hang > > > > Am 28.03.2018 um 10:07 schrieb Emily Deng: > >> issue: > >> there are VMC page fault occured if force APP kill during 3dmark > >> test, the cause is in entity_fini we manually signal all those jobs > >> in entity's queue which confuse the sync/dep > >> mechanism: > >> > >> 1)page fault occured in sdma's clear job which operate on shadow > >> buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release > >> since the fence in its reservation was fake signaled by entity_fini() > >> under the case of SIGKILL received. > >> > >> 2)page fault occured in gfx' job because during the lifetime of gfx > >> job we manually fake signal all jobs from its entity in > >> entity_fini(), thus the unmapping/clear PTE job depend on those > >> result fence is satisfied and sdma start clearing the PTE and lead to GFX > page fault. > > Nice catch, but the fixes won't work like this. > > > >> fix: > >> 1)should at least wait all jobs already scheduled complete in > >> entity_fini() if SIGKILL is the case. > > Well that is not a good idea because when we kill a process we actually > want to tear down the task as fast as possible and not wait for anything. > That is actually the reason why we have this handling. > > > >> 2)if a fence signaled and try to clear some entity's dependency, > >> should set this entity guilty to prevent its job really run since the > >> dependency is fake signaled. > > Well, that is a clear NAK. It would mean that you mark things like the X > server or Wayland queue is marked guilty because some client is killed. > > > > And since unmapping/clear jobs don't have a guilty pointer it should > actually not have any effect on the issue. > > > > Regards, > > Christian. > > > > > >> related issue ticket: > >> http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1 > >> > >> Signed-off-by: Monk Liu <Monk.Liu at amd.com> > >> --- > >> drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 > +++++++++++++++++++++++++++++++ > >> 1 file changed, 36 insertions(+) > >> > >> diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c > >> b/drivers/gpu/drm/scheduler/gpu_scheduler.c > >> index 2bd69c4..9b306d3 100644 > >> --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > >> +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > >> @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct > drm_sched_entity *entity) > >> return true; > >> } > >> > >> +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler > *sched, > >> + struct drm_sched_entity *entity) { > >> + struct drm_sched_job *last; > >> + signed long r; > >> + > >> + spin_lock(&sched->job_list_lock); > >> + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node) > >> + if (last->s_fence->scheduled.context == entity- > >fence_context) { > >> + dma_fence_get(&last->s_fence->finished); > >> + break; > >> + } > >> + spin_unlock(&sched->job_list_lock); > >> + > >> + if (&last->node != &sched->ring_mirror_list) { > >> + r = dma_fence_wait_timeout(&last->s_fence->finished, false, > msecs_to_jiffies(500)); > >> + if (r == 0) > >> + DRM_WARN("wait on the fly job timeout\n"); > >> + dma_fence_put(&last->s_fence->finished); > >> + } > >> +} > >> + > >> /** > >> * Destroy a context entity > >> * > >> @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct > drm_gpu_scheduler *sched, > >> entity->dependency = NULL; > >> } > >> > >> + /* Wait till all jobs from this entity really finished otherwise > below > >> + * fake signaling would kickstart sdma's clear PTE jobs and > lead to > >> + * vm fault > >> + */ > >> + drm_sched_entity_wait_otf_signal(sched, entity); > >> + > >> while ((job = to_drm_sched_job(spsc_queue_pop(&entity- > >job_queue)))) { > >> struct drm_sched_fence *s_fence = job->s_fence; > >> drm_sched_fence_scheduled(s_fence); > >> @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct > dma_fence *f, struct dma_fence_cb *cb > >> { > >> struct drm_sched_entity *entity = > >> container_of(cb, struct drm_sched_entity, cb); > >> + > >> + /* set the entity guity since its dependency is > >> + * not really cleared but fake signaled (by SIGKILL > >> + * or GPU recover) > >> + */ > >> + if (f->error && entity->guilty) > >> + atomic_set(entity->guilty, 1); > >> + > >> entity->dependency = NULL; > >> dma_fence_put(f); > >> drm_sched_wakeup(entity->sched);