Am 28.03.2018 um 10:07 schrieb Emily Deng: > issue: > there are VMC page fault occured if force APP kill during > 3dmark test, the cause is in entity_fini we manually signal > all those jobs in entity's queue which confuse the sync/dep > mechanism: > > 1)page fault occured in sdma's clear job which operate on > shadow buffer, and shadow buffer's Gart table is cleaned by > ttm_bo_release since the fence in its reservation was fake signaled > by entity_fini() under the case of SIGKILL received. > > 2)page fault occured in gfx' job because during the lifetime > of gfx job we manually fake signal all jobs from its entity > in entity_fini(), thus the unmapping/clear PTE job depend on those > result fence is satisfied and sdma start clearing the PTE and lead > to GFX page fault. Nice catch, but the fixes won't work like this. > fix: > 1)should at least wait all jobs already scheduled complete in entity_fini() > if SIGKILL is the case. Well that is not a good idea because when we kill a process we actually want to tear down the task as fast as possible and not wait for anything. That is actually the reason why we have this handling. > 2)if a fence signaled and try to clear some entity's dependency, should > set this entity guilty to prevent its job really run since the dependency > is fake signaled. Well, that is a clear NAK. It would mean that you mark things like the X server or Wayland queue is marked guilty because some client is killed. And since unmapping/clear jobs don't have a guilty pointer it should actually not have any effect on the issue. Regards, Christian. > > related issue ticket: > http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1 > > Signed-off-by: Monk Liu <Monk.Liu at amd.com> > --- > drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++ > 1 file changed, 36 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c > index 2bd69c4..9b306d3 100644 > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c > @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity) > return true; > } > > +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched, > + struct drm_sched_entity *entity) > +{ > + struct drm_sched_job *last; > + signed long r; > + > + spin_lock(&sched->job_list_lock); > + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node) > + if (last->s_fence->scheduled.context == entity->fence_context) { > + dma_fence_get(&last->s_fence->finished); > + break; > + } > + spin_unlock(&sched->job_list_lock); > + > + if (&last->node != &sched->ring_mirror_list) { > + r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500)); > + if (r == 0) > + DRM_WARN("wait on the fly job timeout\n"); > + dma_fence_put(&last->s_fence->finished); > + } > +} > + > /** > * Destroy a context entity > * > @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched, > entity->dependency = NULL; > } > > + /* Wait till all jobs from this entity really finished otherwise below > + * fake signaling would kickstart sdma's clear PTE jobs and lead to > + * vm fault > + */ > + drm_sched_entity_wait_otf_signal(sched, entity); > + > while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { > struct drm_sched_fence *s_fence = job->s_fence; > drm_sched_fence_scheduled(s_fence); > @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb > { > struct drm_sched_entity *entity = > container_of(cb, struct drm_sched_entity, cb); > + > + /* set the entity guity since its dependency is > + * not really cleared but fake signaled (by SIGKILL > + * or GPU recover) > + */ > + if (f->error && entity->guilty) > + atomic_set(entity->guilty, 1); > + > entity->dependency = NULL; > dma_fence_put(f); > drm_sched_wakeup(entity->sched);