Re: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Sat, 27 Feb 2021 19:55:18 -0500



    On 2021-02-26 10:56 p.m., Liu, Monk
      wrote:

    
        [AMD Official Use Only - Internal Distribution Only]

      
          H
              Andrey
           
          The
              scenario I hit here is not the one you mentioned, let me
              explain it with more details by another much easier
              understood example:
           
          Consider
              ring you have a job1 on KCQ, but the timeout of KCQ is 60
              seconds (just for example)
          You
              also have a job2 on GFX ring, and the timeout of GFX is 2
              seconds
           
          We
              submit job1 first, and assume job1 have bug and it will
              cause shader hang very very soon
              
          After
              10 seconds we submit job2, since KCQ have 60 seconds to
              report TDR thus SW know nothing about the engine already
              hang
        
      
    So gfx job hangs because it has a dependency on buggy compute job
      which already is hanging ?
    

          After
              2 seconds we got TDR report from job2 on GFX ring,
              sched_job_timeout() think the leading job of GFX ring is
              the black sheep so it is deleted from the mirror list
          But
              in fact this job1 is innocent, and we should insert it
              back after recovery , and due to it was already deleted
              this innocent job’s context/process is really harmed
        
      
    I am still missing something - we don't ever delete bad jobs or
      any jobs until they are signaled, we reinsert the bad  job back
      into mirror list in drm_sched_stop 

      (here -
https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/scheduler/sched_main.c#L385)
      after sched thread is stopped and continue with the reset
      procedure.
    Andrey
    

          Hope
              above example helps
           
          Thanks
              
           
              发件人: Grodzovsky,
                  Andrey <Andrey.Grodzovsky@xxxxxxx>
                  

                发送时间: 2021年2月27日 0:50

                  收件人: Liu, Monk <Monk.Liu@xxxxxxx>;
                    Koenig, Christian <Christian.Koenig@xxxxxxx>;
                    amd-gfx@xxxxxxxxxxxxxxxxxxxxx

                  抄送: Zhang, Andy
                    <Andy.Zhang@xxxxxxx>; Chen, Horace
                    <Horace.Chen@xxxxxxx>; Zhang, Jack (Jian)
                    <Jack.Zhang1@xxxxxxx>

                  主题: Re: [RFC] a new approach to detect
                    which ring is the real black sheep upon TDR reported
            
          
            On 2021-02-26 6:54
                a.m., Liu, Monk wrote:
          
          
            [AMD Official Use Only - Internal
                Distribution Only]
             
            See in line
             
            
              Thanks 
               
              ------------------------------------------
              Monk Liu |
                  Cloud-GPU Core team
              ------------------------------------------
            
             
                From: Koenig, Christian
                    <Christian.Koenig@xxxxxxx>
                    

                    Sent: Friday, February 26, 2021 3:58 PM

                    To: Liu, Monk <Monk.Liu@xxxxxxx>;
                    
                      amd-gfx@xxxxxxxxxxxxxxxxxxxxx

                    Cc: Zhang, Andy <Andy.Zhang@xxxxxxx>;
                    Chen, Horace
                    <Horace.Chen@xxxxxxx>;
                    Zhang, Jack (Jian)
                    <Jack.Zhang1@xxxxxxx>

                    Subject: Re: [RFC] a new approach to detect
                    which ring is the real black sheep upon TDR reported
              
            
            Hi Monk,

                
                in general an interesting idea, but I see two major
                problems with that:

                
                1. It would make the reset take much longer.

                
                2. Things get often stuck because of timing issues, so a
                guilty job might pass perfectly when run a second time.
            [ML] but the innocent ring already reported
                a TDR, and the drm sched logic already deleted this
                “sched_job” in its mirror list, thus you don’t have
                chance to re-submit it again after reset, that’s the
                major problem here.
          
           
          Just to confirm I understand correctly,
              Monk reports a scenario where the second TDR that was
              reported by the innocent job is bailing out BEFORE having
              a chance to run  drm_sched_stop for that scheduler which
              should have reinserted the job back into mirror list
              (because the first TDR run is still in progress and hence
              amdgpu_device_lock_adev fails for the second TDR) and so
              the innocent job which was extracted from mirror list in
              drm_sched_job_timedout is now lost.

              If so and as a possible quick fix until we overhaul the
              entire design as suggested in this thread - maybe we can
              modify drm_sched_backend_ops.timedout_job callback to
              report back premature termination BEFORE drm_sched_stop
              had a chance to run and then reinsert back the job into
              mirror list from within  drm_sched_job_timedout? There is
              no problem of racing against concurrent
              drm_sched_get_cleanup_job once we reinsert there as we
              don't reference the job pointer anymore after this point
              and so if it's already signaled and freed right away -
              it's ok. 
          Andrey
           
          
                Apart from that the whole ring mirror list turned out to
                be a really bad idea. E.g. we still struggle with object
                life time because the concept doesn't fit into the
                object model of the GPU scheduler under Linux.

                
                We should probably work on this separately and
                straighten up the job destruction once more and keep the
                recovery information in the fence instead.
            [ML] we claim to our customer that no
                innocent process will be dropped or cancelled, and our
                current logic works for the most time, but only when
                there are different process running on gfx/computes
                rings then we would run into the tricky situation I
                stated here, and the proposal is the only way I can
                figure out so far, do you have a better solution or idea
                we review it as another candidate RFC ? Be note that we
                raised this proposal is because we do hit our trouble
                and we do need to resolve it …. So even a not perfect
                solution is still better than just cancel the innocent
                job (and their context/process)
            Thanks ! 
                
            
                Regards,

                Christian.
            
              Am 26.02.21 um
                  06:58 schrieb Liu, Monk:
            
            
              [AMD Public Use]
               
              Hi all
               
              NAVI2X  project
                  hit a really hard to solve issue now, and it is turned
                  out to be a general headache of our TDR mechanism ,
                  check below scenario:
               
              
                There is a job1 running on compute1
                    ring at timestamp
                    
                There is a job2 running on gfx ring at
                    timestamp
                Job1 is the guilty one, and job1/job2
                    were scheduled to their rings at almost the same
                    timestamp
                    
                After 2 seconds we receive two TDR
                    reporting from both GFX ring and compute ring
                Current scheme is that in drm
                      scheduler all the head jobs of those two rings are
                      considered “bad job” and taken away from the
                      mirror list
                    
                The result is both the real guilty job
                    (job1) and the innocent job (job2) were all deleted
                    from mirror list, and their corresponding contexts
                    were also treated as guilty (so the innocent
                      process remains running is not secured)
              
               
              But by our wish
                  the ideal case is TDR mechanism can detect which ring
                  is the guilty ring and the innocent ring can resubmits
                  all its pending jobs:
              
                Job1 to be deleted from compute1 ring’s
                    mirror list
                Job2 is kept and resubmitted later and
                    its belonging process/context are even not aware of
                    this TDR at all
                    
              
              Here I have a
                  proposal tend to achieve above goal and it rough
                  procedure is :
              
                Once any ring reports a TDR, the head
                    job is *not* treated as “bad job”, and it is
                    *not* deleted from the mirror list in drm
                    sched functions
                In vendor’s function (our amdgpu driver
                    here):
              
              
                  reset GPU
                  repeat below actions on each RINGS *
                      one by one *:
                
              
                1.
                    take the head job and submit it on this
                  ring
              
                2.
                    see if it completes, if not then this job
                  is the real “bad job”
              
                3.
                     take it away from mirror list if this
                  head job is “bad job”
              
                
                  After above iteration on all RINGS,
                      we already clears all the bad job(s)
                
              
                Resubmit all jobs from each mirror list
                    to their corresponding rings (this is the existed
                    logic)
              
               
              The idea of this
                  is to use “serial” way to re-run and re-check each
                  head job of each RING, in order to take out the real
                  black sheep and its guilty context.
               
              P.S.: we can use
                  this approaches only on GFX/KCQ ring reports TDR ,
                  since those rings are intermutually affected to each
                  other. For SDMA ring timeout it definitely proves the
                  head job on SDMA ring is really guilty.
               
              Thanks 
               
              ------------------------------------------
              Monk Liu |
                  Cloud-GPU Core team
              ------------------------------------------
               
            
            _______________________________________________
            amd-gfx mailing list
            amd-gfx@xxxxxxxxxxxxxxxxxxxxx
            https://lists.freedesktop.org/mailman/listinfo/amd-gfx
          
        
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx