Re: [PATCH] drm/scheduler: Remove entity->rq NULL check

Christian König <christian.koenig@xxxxxxx> · Tue, 14 Aug 2018 17:26:56 +0200



    Am 14.08.2018 um 17:17 schrieb Andrey
      Grodzovsky:

    
      I assume that this is the only code change and no locks are
        taken in drm_sched_entity_push_job - 

      
    What are you talking about? You surely now take looks in
    drm_sched_entity_push_job():

    +    spin_lock(&entity->rq_lock);

      +    entity->last_user = current->group_leader;

      +    if (list_empty(&entity->list))

    
      What happens if process A runs drm_sched_entity_push_job after
        this code was executed from the  (dying) process B and there
      are still jobs in the queue (the wait_event terminated
        prematurely), the entity already removed from rq , but bool
        'first' in drm_sched_entity_push_job
      will return false and so the entity will not be reinserted back
        into rq entity list and no wake up trigger will happen for
        process A pushing a new job.
    
    
    Thought about this as well, but in this case I would say: Shit
    happens!

    
    The dying process did some command submission and because of this
    the entity was killed as well when the process died and that is
    legitimate.

    
      Another issue bellow - 

      
      Andrey

      
      On 08/14/2018 03:05 AM, Christian
        König wrote:

      
        I would rather like to avoid taking
          the lock in the hot path.

          
          How about this:

          
               /* For killed process disable any more IBs enqueue right
          now */

              last_user = cmpxchg(&entity->last_user,
          current->group_leader, NULL);

               if ((!last_user || last_user == current->group_leader)
          &&

                   (current->flags & PF_EXITING) &&
          (current->exit_code == SIGKILL)) {

                  grab_lock();

                   drm_sched_rq_remove_entity(entity->rq, entity);

                  if (READ_ONCE(&entity->last_user) != NULL)

        
      This condition is true because just exactly now process A did
      drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
      current->group_leader);

      and so the line bellow executed and entity reinserted into rq.
      Let's say also that the entity job queue is empty now. For process
      A bool 'first' will be true

      and hence also
      drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
      entity) will take place causing double insertion of the entity
      queue into rq list.

    
    Calling drm_sched_rq_add_entity() is harmless, it is protected
    against double insertion.

    
    But thinking more about it your idea of adding a killed or finished
    flag becomes more and more appealing to have a consistent handling
    here.

    
    Christian.

    
      Andrey

      
          drm_sched_rq_add_entity(entity->rq, entity);

                  drop_lock();

              }

           
          Christian.

          
          Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

        
          Attached. 
          If the general idea in the patch is OK I can think of a
            test (and maybe add to libdrm amdgpu tests) to actually
            simulate this scenario with 2 forked
          concurrent processes working on same entity's job queue
            when one is dying while the other keeps pushing to the same
            queue. For now I only tested it
          with normal boot and ruining multiple glxgears concurrently
            - which doesn't really test this code path since i think
            each of them works on it's own FD.

          
          Andrey

          
          On 08/10/2018 09:27 AM, Christian
            König wrote:

          
            Crap, yeah indeed that needs to
              be protected by some lock.

              
              Going to prepare a patch for that,

              Christian.

              
              Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

            
              Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
              

              But I still  have questions about entity->last_user
                (didn't notice this before) - 

              
              Looks to me there is a race condition with it's current
                usage, let's say process A was preempted after doing
                drm_sched_entity_flush->cmpxchg(...)
              now process B working on same entity (forked) is inside
                drm_sched_entity_push_job, he writes his PID to
                entity->last_user and also
              executes drm_sched_rq_add_entity. Now process A runs
                again and execute drm_sched_rq_remove_entity
                inadvertently causing process B removal
              from it's scheduler rq.
              Looks to me like instead we should lock together
                entity->last_user accesses and adds/removals of
                entity to the rq.
              Andrey

              
              On 08/06/2018 10:18 AM, Nayan
                Deshmukh wrote:

              
                    I forgot about this since we started discussing
                      possible scenarios of processes and threads.

                      
                    In any case, this check is redundant. Acked-by:
                    Nayan Deshmukh <nayan26deshmukh@xxxxxxxxx>

                    
                  Nayan

                
                  On Mon, Aug 6, 2018 at 7:43 PM
                    Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
                    wrote:

                  
                  Ping.
                    Any objections to that?

                    
                    Christian.

                    
                    Am 03.08.2018 um 13:08 schrieb Christian König:

                    > That is superflous now.

                    >

                    > Signed-off-by: Christian König <christian.koenig@xxxxxxx>

                    > ---

                    >   drivers/gpu/drm/scheduler/gpu_scheduler.c | 5
                    -----

                    >   1 file changed, 5 deletions(-)

                    >

                    > diff --git
                    a/drivers/gpu/drm/scheduler/gpu_scheduler.c
                    b/drivers/gpu/drm/scheduler/gpu_scheduler.c

                    > index 85908c7f913e..65078dd3c82c 100644

                    > --- a/drivers/gpu/drm/scheduler/gpu_scheduler.c

                    > +++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c

                    > @@ -590,11 +590,6 @@ void
                    drm_sched_entity_push_job(struct drm_sched_job
                    *sched_job,

                    >       if (first) {

                    >               /* Add the entity to the run
                    queue */

                    >             
                     spin_lock(&entity->rq_lock);

                    > -             if (!entity->rq) {

                    > -                     DRM_ERROR("Trying to push
                    to a killed entity\n");

                    > -                   
                     spin_unlock(&entity->rq_lock);

                    > -                     return;

                    > -             }

                    >             
                     drm_sched_rq_add_entity(entity->rq, entity);

                    >             
                     spin_unlock(&entity->rq_lock);

                    >             
                     drm_sched_wakeup(entity->rq->sched);

                    
          _______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

        
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel