Re: [PATCH] drm/scheduler: Remove entity->rq NULL check

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Tue, 14 Aug 2018 11:17:26 -0400



    I assume that this is the only code change and no locks are taken
      in drm_sched_entity_push_job - 

    
    What happens if process A runs drm_sched_entity_push_job after
      this code was executed from the  (dying) process B and there
    are still jobs in the queue (the wait_event terminated
      prematurely), the entity already removed from rq , but bool
      'first' in drm_sched_entity_push_job
    will return false and so the entity will not be reinserted back
      into rq entity list and no wake up trigger will happen for process
      A pushing a new job.
    

    Another issue bellow - 

    
    Andrey

    
    On 08/14/2018 03:05 AM, Christian König
      wrote:

    
      I would rather like to avoid taking
        the lock in the hot path.

        
        How about this:

        
             /* For killed process disable any more IBs enqueue right
        now */

            last_user = cmpxchg(&entity->last_user,
        current->group_leader, NULL);

             if ((!last_user || last_user == current->group_leader)
        &&

                 (current->flags & PF_EXITING) &&
        (current->exit_code == SIGKILL)) {

                grab_lock();

                 drm_sched_rq_remove_entity(entity->rq, entity);

                if (READ_ONCE(&entity->last_user) != NULL)

      
    This condition is true because just exactly now process A did
    drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
    current->group_leader);

    and so the line bellow executed and entity reinserted into rq. Let's
    say also that the entity job queue is empty now. For process A bool
    'first' will be true

    and hence also
    drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
    entity) will take place causing double insertion of the entity queue
    into rq list.

    
    Andrey

    
        drm_sched_rq_add_entity(entity->rq, entity);

                drop_lock();

            }

         
        Christian.

        
        Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

      
        Attached. 
        If the general idea in the patch is OK I can think of a test
          (and maybe add to libdrm amdgpu tests) to actually simulate
          this scenario with 2 forked
        concurrent processes working on same entity's job queue when
          one is dying while the other keeps pushing to the same queue.
          For now I only tested it
        with normal boot and ruining multiple glxgears concurrently -
          which doesn't really test this code path since i think each of
          them works on it's own FD.

        
        Andrey

        
        On 08/10/2018 09:27 AM, Christian
          König wrote:

        
          Crap, yeah indeed that needs to
            be protected by some lock.

            
            Going to prepare a patch for that,

            Christian.

            
            Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

          
            Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
            

            But I still  have questions about entity->last_user
              (didn't notice this before) - 

            
            Looks to me there is a race condition with it's current
              usage, let's say process A was preempted after doing
              drm_sched_entity_flush->cmpxchg(...)
            now process B working on same entity (forked) is inside
              drm_sched_entity_push_job, he writes his PID to
              entity->last_user and also
            executes drm_sched_rq_add_entity. Now process A runs
              again and execute drm_sched_rq_remove_entity inadvertently
              causing process B removal
            from it's scheduler rq.
            Looks to me like instead we should lock together
              entity->last_user accesses and adds/removals of entity
              to the rq.
            Andrey

            
            On 08/06/2018 10:18 AM, Nayan
              Deshmukh wrote:

            
                  I forgot about this since we started discussing
                    possible scenarios of processes and threads.

                    
                  In any case, this check is redundant. Acked-by: Nayan
                  Deshmukh <nayan26deshmukh@xxxxxxxxx>

                  
                Nayan

              
                On Mon, Aug 6, 2018 at 7:43 PM Christian
                  König <ckoenig.leichtzumerken@xxxxxxxxx>
                  wrote:

                
                Ping.
                  Any objections to that?

                  
                  Christian.

                  
                  Am 03.08.2018 um 13:08 schrieb Christian König:

                  > That is superflous now.

                  >

                  > Signed-off-by: Christian König <christian.koenig@xxxxxxx>

                  > ---

                  >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5
                  -----

                  >   1 file changed, 5 deletions(-)

                  >

                  > diff --git
                  a/drivers/gpu/drm/scheduler/gpu_scheduler.c
                  b/drivers/gpu/drm/scheduler/gpu_scheduler.c

                  > index 85908c7f913e..65078dd3c82c 100644

                  > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

                  > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

                  > @@ -590,11 +590,6 @@ void
                  drm_sched_entity_push_job(struct drm_sched_job
                  *sched_job,

                  >       if (first) {

                  >               /* Add the entity to the run queue
                  */

                  >               spin_lock(&entity->rq_lock);

                  > -             if (!entity->rq) {

                  > -                     DRM_ERROR("Trying to push
                  to a killed entity\n");

                  > -                   
                   spin_unlock(&entity->rq_lock);

                  > -                     return;

                  > -             }

                  >             
                   drm_sched_rq_add_entity(entity->rq, entity);

                  >             
                   spin_unlock(&entity->rq_lock);

                  >             
                   drm_sched_wakeup(entity->rq->sched);

                  
        _______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

      
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel