Attached. If the general idea in the patch is OK I can think of a test (and maybe add to libdrm amdgpu tests) to actually simulate this scenario with 2 forked concurrent processes working on same entity's job queue when one is dying while the other keeps pushing to the same queue. For now I only tested it with normal boot and ruining multiple glxgears concurrently -
which doesn't really test this code path since i think each of
them works on it's own FD. Andrey On 08/10/2018 09:27 AM, Christian König
wrote:
|
>From c581a410e1b3f82de2bb14746d8484db2162f82c Mon Sep 17 00:00:00 2001 From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> Date: Mon, 13 Aug 2018 12:33:29 -0400 Subject: drm/scheduler: Fix possible race condition. Problem: The possbile race scenario is as follwing - Process A was preempted after doing drm_sched_entity_flush->cmpxchg(...) now process B working on same entity (forked) is inside drm_sched_entity_push_job, he writes his PID to entity->last_user and also executes drm_sched_rq_add_entity. Now process A runs again and execute drm_sched_rq_remove_entity inadvertently causing process B removal from it's scheduler rq. Fix: Lock together entity->last_user accesses and adds/removals of entity to the rq. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> --- drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c index f566405..5649c3d 100644 --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c @@ -311,10 +311,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout) /* For killed process disable any more IBs enqueue right now */ + spin_lock(&entity->rq_lock); last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) drm_sched_rq_remove_entity(entity->rq, entity); + spin_unlock(&entity->rq_lock); return ret; } @@ -596,17 +598,23 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job, trace_drm_sched_job(sched_job, entity); atomic_inc(&entity->rq->sched->num_jobs); - WRITE_ONCE(entity->last_user, current->group_leader); first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node); - /* first job wakes up scheduler */ - if (first) { - /* Add the entity to the run queue */ - spin_lock(&entity->rq_lock); + /* + * entity might not be attached to rq because it's the first time we use it + * or because another process removed the entity from the rq during + * it's termination in drm_sched_entity_flush. We then need to add + * back the entity to rq + */ + spin_lock(&entity->rq_lock); + entity->last_user = current->group_leader; + if (list_empty(&entity->list)) drm_sched_rq_add_entity(entity->rq, entity); - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->rq_lock); + + /* first job wakes up scheduler */ + if (first) drm_sched_wakeup(entity->rq->sched); - } } EXPORT_SYMBOL(drm_sched_entity_push_job); -- 2.7.4
_______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel