issue: there are VMC page fault occured if force APP kill during 3dmark test, the cause is in entity_fini we manually signal all those jobs in entity's queue which confuse the sync/dep mechanism: 1)page fault occured in sdma's clear job which operate on shadow buffer, and shadow buffer's Gart table is cleaned by ttm_bo_release since the fence in its reservation was fake signaled by entity_fini() under the case of SIGKILL received. 2)page fault occured in gfx' job because during the lifetime of gfx job we manually fake signal all jobs from its entity in entity_fini(), thus the unmapping/clear PTE job depend on those result fence is satisfied and sdma start clearing the PTE and lead to GFX page fault. fix: 1)should at least wait all jobs already scheduled complete in entity_fini() if SIGKILL is the case. 2)if a fence signaled and try to clear some entity's dependency, should set this entity guilty to prevent its job really run since the dependency is fake signaled. related issue ticket: http://ontrack-internal.amd.com/browse/SWDEV-147564?filter=-1 Signed-off-by: Monk Liu <Monk.Liu at amd.com> --- drivers/gpu/drm/scheduler/gpu_scheduler.c | 36 +++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index 2bd69c4..9b306d3 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -198,6 +198,28 @@ static bool drm_sched_entity_is_ready(struct drm_sched_entity *entity) return true; } +static void drm_sched_entity_wait_otf_signal(struct drm_gpu_scheduler *sched, + struct drm_sched_entity *entity) +{ + struct drm_sched_job *last; + signed long r; + + spin_lock(&sched->job_list_lock); + list_for_each_entry_reverse(last, &sched->ring_mirror_list, node) + if (last->s_fence->scheduled.context == entity->fence_context) { + dma_fence_get(&last->s_fence->finished); + break; + } + spin_unlock(&sched->job_list_lock); + + if (&last->node != &sched->ring_mirror_list) { + r = dma_fence_wait_timeout(&last->s_fence->finished, false, msecs_to_jiffies(500)); + if (r == 0) + DRM_WARN("wait on the fly job timeout\n"); + dma_fence_put(&last->s_fence->finished); + } +} + /** * Destroy a context entity * @@ -238,6 +260,12 @@ void drm_sched_entity_fini(struct drm_gpu_scheduler *sched, entity->dependency = NULL; } + /* Wait till all jobs from this entity really finished otherwise below + * fake signaling would kickstart sdma's clear PTE jobs and lead to + * vm fault + */ + drm_sched_entity_wait_otf_signal(sched, entity); + while ((job = to_drm_sched_job(spsc_queue_pop(&entity->job_queue)))) { struct drm_sched_fence *s_fence = job->s_fence; drm_sched_fence_scheduled(s_fence); @@ -255,6 +283,14 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, struct dma_fence_cb *cb { struct drm_sched_entity *entity = container_of(cb, struct drm_sched_entity, cb); + + /* set the entity guity since its dependency is + * not really cleared but fake signaled (by SIGKILL + * or GPU recover) + */ + if (f->error && entity->guilty) + atomic_set(entity->guilty, 1); + entity->dependency = NULL; dma_fence_put(f); drm_sched_wakeup(entity->sched); -- 2.7.4