Re: [PATCH] drm/scheduler: Remove entity->rq NULL check

Andrey Grodzovsky <Andrey.Grodzovsky@xxxxxxx> · Tue, 14 Aug 2018 12:32:42 -0400



    On 08/14/2018 11:26 AM, Christian König
      wrote:

    
      Am 14.08.2018 um 17:17 schrieb Andrey
        Grodzovsky:

      
        I assume that this is the only code change and no locks are
          taken in drm_sched_entity_push_job - 

        
      What are you talking about? You surely now take looks in
      drm_sched_entity_push_job():

      +    spin_lock(&entity->rq_lock);

        +    entity->last_user = current->group_leader;

        +    if (list_empty(&entity->list))

      
    Oh, so your code in drm_sched_entity_flush still relies on my code
    in drm_sched_entity_push_job, OK.

    
        What happens if process A runs drm_sched_entity_push_job
          after this code was executed from the  (dying) process B and
          there
        are still jobs in the queue (the wait_event terminated
          prematurely), the entity already removed from rq , but bool
          'first' in drm_sched_entity_push_job
        will return false and so the entity will not be reinserted
          back into rq entity list and no wake up trigger will happen
          for process A pushing a new job.
      
      
      Thought about this as well, but in this case I would say: Shit
      happens!

      
      The dying process did some command submission and because of this
      the entity was killed as well when the process died and that is
      legitimate.

      
        Another issue bellow - 

        
        Andrey

        
        On 08/14/2018 03:05 AM, Christian
          König wrote:

        
          I would rather like to avoid
            taking the lock in the hot path.

            
            How about this:

            
                 /* For killed process disable any more IBs enqueue
            right now */

                last_user = cmpxchg(&entity->last_user,
            current->group_leader, NULL);

                 if ((!last_user || last_user ==
            current->group_leader) &&

                     (current->flags & PF_EXITING) &&
            (current->exit_code == SIGKILL)) {

                    grab_lock();

                     drm_sched_rq_remove_entity(entity->rq, entity);

                    if (READ_ONCE(&entity->last_user) != NULL)

          
        This condition is true because just exactly now process A did
        drm_sched_entity_push_job->WRITE_ONCE(entity->last_user,
        current->group_leader);

        and so the line bellow executed and entity reinserted into rq.
        Let's say also that the entity job queue is empty now. For
        process A bool 'first' will be true

        and hence also
        drm_sched_entity_push_job->drm_sched_rq_add_entity(entity->rq,
        entity) will take place causing double insertion of the entity
        queue into rq list.

      
      Calling drm_sched_rq_add_entity() is harmless, it is protected
      against double insertion.

    
    Missed that one, right...

     
      But thinking more about it your idea of adding a killed or
      finished flag becomes more and more appealing to have a consistent
      handling here.

      
      Christian.

    
    So to be clear - you would like something like 

    
    Removing entity->last_user and adding a 'stopped' flag to
    drm_sched_entity to be set in drm_sched_entity_flush and in
    

    drm_sched_entity_push_job check for  'if (entity->stopped)' and
    when true just return some error back to user instead of pushing the
    job ? 

    
    Andrey

    
        Andrey

        
            drm_sched_rq_add_entity(entity->rq, entity);

                    drop_lock();

                }

             
            Christian.

            
            Am 13.08.2018 um 18:43 schrieb Andrey Grodzovsky:

          
            Attached. 
            If the general idea in the patch is OK I can think of a
              test (and maybe add to libdrm amdgpu tests) to actually
              simulate this scenario with 2 forked
            concurrent processes working on same entity's job queue
              when one is dying while the other keeps pushing to the
              same queue. For now I only tested it
            with normal boot and ruining multiple glxgears
              concurrently - which doesn't really test this code path
              since i think each of them works on it's own FD.

            
            Andrey

            
            On 08/10/2018 09:27 AM,
              Christian König wrote:

            
              Crap, yeah indeed that needs
                to be protected by some lock.

                
                Going to prepare a patch for that,

                Christian.

                
                Am 09.08.2018 um 21:49 schrieb Andrey Grodzovsky:

              
                Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
                

                But I still  have questions about
                  entity->last_user (didn't notice this before) - 

                
                Looks to me there is a race condition with it's
                  current usage, let's say process A was preempted after
                  doing drm_sched_entity_flush->cmpxchg(...)
                now process B working on same entity (forked) is
                  inside drm_sched_entity_push_job, he writes his PID to
                  entity->last_user and also
                executes drm_sched_rq_add_entity. Now process A runs
                  again and execute drm_sched_rq_remove_entity
                  inadvertently causing process B removal
                from it's scheduler rq.
                Looks to me like instead we should lock together
                  entity->last_user accesses and adds/removals of
                  entity to the rq.
                Andrey

                
                On 08/06/2018 10:18 AM,
                  Nayan Deshmukh wrote:

                
                      I forgot about this since we started
                        discussing possible scenarios of processes and
                        threads.

                        
                      In any case, this check is redundant. Acked-by:
                      Nayan Deshmukh <nayan26deshmukh@xxxxxxxxx>

                      
                    Nayan

                  
                    On Mon, Aug 6, 2018 at 7:43 PM
                      Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
                      wrote:

                    
                    Ping.
                      Any objections to that?

                      
                      Christian.

                      
                      Am 03.08.2018 um 13:08 schrieb Christian König:

                      > That is superflous now.

                      >

                      > Signed-off-by: Christian König <christian.koenig@xxxxxxx>

                      > ---

                      >   drivers/gpu/drm/scheduler/gpu_scheduler.c |
                      5 -----

                      >   1 file changed, 5 deletions(-)

                      >

                      > diff --git
                      a/drivers/gpu/drm/scheduler/gpu_scheduler.c
                      b/drivers/gpu/drm/scheduler/gpu_scheduler.c

                      > index 85908c7f913e..65078dd3c82c 100644

                      > ---
                      a/drivers/gpu/drm/scheduler/gpu_scheduler.c

                      > +++
                      b/drivers/gpu/drm/scheduler/gpu_scheduler.c

                      > @@ -590,11 +590,6 @@ void
                      drm_sched_entity_push_job(struct drm_sched_job
                      *sched_job,

                      >       if (first) {

                      >               /* Add the entity to the run
                      queue */

                      >             
                       spin_lock(&entity->rq_lock);

                      > -             if (!entity->rq) {

                      > -                     DRM_ERROR("Trying to
                      push to a killed entity\n");

                      > -                   
                       spin_unlock(&entity->rq_lock);

                      > -                     return;

                      > -             }

                      >             
                       drm_sched_rq_add_entity(entity->rq, entity);

                      >             
                       spin_unlock(&entity->rq_lock);

                      >             
                       drm_sched_wakeup(entity->rq->sched);

                      
            _______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel

          
_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel