Re: [PATCH] drm/scheduler: Remove entity->rq NULL check

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Mon, 13 Aug 2018 12:43:25 -0400

Attached. 
    If the general idea in the patch is OK I can think of a test (and
      maybe add to libdrm amdgpu tests) to actually simulate this
      scenario with 2 forked
    concurrent processes working on same entity's job queue when one
      is dying while the other keeps pushing to the same queue. For now
      I only tested it
    with normal boot and ruining multiple glxgears concurrently -
      which doesn't really test this code path since i think each of
      them works on it's own FD.

    
    Andrey

    
    

    On 08/10/2018 09:27 AM, Christian König
      wrote:

    
    
      
      Crap, yeah indeed that needs to be
        protected by some lock.

        

        Going to prepare a patch for that,

        Christian.

        

        Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

      
      
        Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
        

        
        But I still  have questions about entity->last_user
          (didn't notice this before) - 

        
        Looks to me there is a race condition with it's current
          usage, let's say process A was preempted after doing
          drm_sched_entity_flush->cmpxchg(...)
        now process B working on same entity (forked) is inside
          drm_sched_entity_push_job, he writes his PID to
          entity->last_user and also
        executes drm_sched_rq_add_entity. Now process A runs again
          and execute drm_sched_rq_remove_entity inadvertently causing
          process B removal
        from it's scheduler rq.
        Looks to me like instead we should lock together
          entity->last_user accesses and adds/removals of entity to
          the rq.
        Andrey

        
        

        On 08/06/2018 10:18 AM, Nayan
          Deshmukh wrote:

        
        
          
            
              I forgot about this since we started discussing
                possible scenarios of processes and threads.

                

              
              In any case, this check is redundant. Acked-by: Nayan
              Deshmukh <nayan26deshmukh@xxxxxxxxx>

              

            
            Nayan

          
          

          
            On Mon, Aug 6, 2018 at 7:43 PM Christian
              König <ckoenig.leichtzumerken@xxxxxxxxx>
              wrote:

            
            Ping.
              Any objections to that?

              

              Christian.

              

              Am 03.08.2018 um 13:08 schrieb Christian König:

              > That is superflous now.

              >

              > Signed-off-by: Christian König <christian.koenig@xxxxxxx>

              > ---

              >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5 -----

              >   1 file changed, 5 deletions(-)

              >

              > diff --git
              a/drivers/gpu/drm/scheduler/gpu_scheduler.c
              b/drivers/gpu/drm/scheduler/gpu_scheduler.c

              > index 85908c7f913e..65078dd3c82c 100644

              > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

              > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

              > @@ -590,11 +590,6 @@ void
              drm_sched_entity_push_job(struct drm_sched_job *sched_job,

              >       if (first) {

              >               /* Add the entity to the run queue */

              >               spin_lock(&entity->rq_lock);

              > -             if (!entity->rq) {

              > -                     DRM_ERROR("Trying to push to a
              killed entity\n");

              > -                   
               spin_unlock(&entity->rq_lock);

              > -                     return;

              > -             }

              >               drm_sched_rq_add_entity(entity->rq,
              entity);

              >               spin_unlock(&entity->rq_lock);

              >             
               drm_sched_wakeup(entity->rq->sched);

              

            
          
        
        

      
      

    
    

  

>From c581a410e1b3f82de2bb14746d8484db2162f82c Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
Date: Mon, 13 Aug 2018 12:33:29 -0400
Subject: drm/scheduler: Fix possible race condition.

Problem:
The possbile race scenario is as follwing -
Process A was preempted after doing drm_sched_entity_flush->cmpxchg(...)
now process B working on same entity (forked) is inside
drm_sched_entity_push_job, he writes his PID to entity->last_user and also
executes drm_sched_rq_add_entity. Now process A runs again and
execute drm_sched_rq_remove_entity inadvertently causing process B removal
from it's scheduler rq.

Fix:
Lock together entity->last_user accesses and adds/removals
of entity to the rq.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
 drivers/gpu/drm/scheduler/gpu_scheduler.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index f566405..5649c3d 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -311,10 +311,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 
 
 	/* For killed process disable any more IBs enqueue right now */
+	spin_lock(&entity->rq_lock);
 	last_user = cmpxchg(&entity->last_user, current->group_leader, NULL);
 	if ((!last_user || last_user == current->group_leader) &&
 	    (current->flags & PF_EXITING) && (current->exit_code == SIGKILL))
 		drm_sched_rq_remove_entity(entity->rq, entity);
+	spin_unlock(&entity->rq_lock);
 
 	return ret;
 }
@@ -596,17 +598,23 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
 
 	trace_drm_sched_job(sched_job, entity);
 	atomic_inc(&entity->rq->sched->num_jobs);
-	WRITE_ONCE(entity->last_user, current->group_leader);
 	first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
 
-	/* first job wakes up scheduler */
-	if (first) {
-		/* Add the entity to the run queue */
-		spin_lock(&entity->rq_lock);
+	/*
+	 * entity might not be attached to rq because it's the first time we use it
+	 * or because another process removed the entity from the rq during
+	 * it's termination in drm_sched_entity_flush. We then need to add
+	 * back the entity to rq
+	 */
+	spin_lock(&entity->rq_lock);
+	entity->last_user = current->group_leader;
+	if (list_empty(&entity->list))
 		drm_sched_rq_add_entity(entity->rq, entity);
-		spin_unlock(&entity->rq_lock);
+	spin_unlock(&entity->rq_lock);
+
+	/* first job wakes up scheduler */
+	if (first)
 		drm_sched_wakeup(entity->rq->sched);
-	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
 
-- 
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel