Re: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Christian König <christian.koenig@xxxxxxx> · Fri, 26 Feb 2021 08:57:52 +0100



    Hi Monk,

    
    in general an interesting idea, but I see two major problems with
    that:

    
    1. It would make the reset take much longer.

    
    2. Things get often stuck because of timing issues, so a guilty job
    might pass perfectly when run a second time.

    
    Apart from that the whole ring mirror list turned out to be a really
    bad idea. E.g. we still struggle with object life time because the
    concept doesn't fit into the object model of the GPU scheduler under
    Linux.

    
    We should probably work on this separately and straighten up the job
    destruction once more and keep the recovery information in the fence
    instead.

    
    Regards,

    Christian.

    
    Am 26.02.21 um 06:58 schrieb Liu, Monk:

    
      [AMD
          Public Use]
      

        Hi all
         
        NAVI2X  project hit a really hard to solve
          issue now, and it is turned out to be a general headache of
          our TDR mechanism , check below scenario:
         
        
          There is a
            job1 running on compute1 ring at timestamp
            
          There is a
            job2 running on gfx ring at timestamp
          Job1 is the
            guilty one, and job1/job2 were scheduled to their rings at
            almost the same timestamp
            
          After 2
            seconds we receive two TDR reporting from both GFX ring and
            compute ring
          Current
              scheme is that in drm scheduler all the head jobs of those
              two rings are considered “bad job” and taken away from the
              mirror list
              
          The result
            is both the real guilty job (job1) and the innocent job
            (job2) were all deleted from mirror list, and their
            corresponding contexts were also treated as guilty (so
              the innocent process remains running is not secured)
        
         
        But by our wish the ideal case is TDR
          mechanism can detect which ring is the guilty ring and the
          innocent ring can resubmits all its pending jobs:
        
          Job1 to be
            deleted from compute1 ring’s mirror list
          Job2 is kept
            and resubmitted later and its belonging process/context are
            even not aware of this TDR at all
            
        
        Here I have a proposal tend to achieve
          above goal and it rough procedure is :
        
          Once any
            ring reports a TDR, the head job is *not* treated as
            “bad job”, and it is *not* deleted from the mirror
            list in drm sched functions
          In vendor’s
            function (our amdgpu driver here):
          
            reset GPU
            repeat
              below actions on each RINGS * one by one *:
          
        
          1.
            take the head job and submit it
          on this ring
        
          2.
            see if it completes, if not then
          this job is the real “bad job”
        
          3.
             take it away from mirror list
          if this head job is “bad job”
        
          
            After
              above iteration on all RINGS, we already clears all the
              bad job(s)
          
          Resubmit all
            jobs from each mirror list to their corresponding rings
            (this is the existed logic)
        
         
        The idea of this is to use “serial” way to
          re-run and re-check each head job of each RING, in order to
          take out the real black sheep and its guilty context.
         
        P.S.: we can use this approaches only on
          GFX/KCQ ring reports TDR , since those rings are intermutually
          affected to each other. For SDMA ring timeout it definitely
          proves the head job on SDMA ring is really guilty.
         
        Thanks 
         
        ------------------------------------------
        Monk Liu | Cloud-GPU Core team
        ------------------------------------------
         
      
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx