Re: amdgpu doesn't do implicit sync, requires drivers to do it in IBs

Christian König <christian.koenig@xxxxxxx> · Fri, 29 May 2020 11:05:40 +0200



    Am 28.05.20 um 21:35 schrieb Marek
      Olšák:

    
          On Thu, May 28, 2020 at 2:12
            PM Christian König <christian.koenig@xxxxxxx>
            wrote:

          
              Am 28.05.20 um 18:06 schrieb Marek Olšák:

              
                    On Thu, May 28,
                      2020 at 10:40 AM Christian König <christian.koenig@xxxxxxx>
                      wrote:

                    
                    Am 28.05.20 um
                      12:06 schrieb Michel Dänzer:

                      > On 2020-05-28 11:11 a.m., Christian König
                      wrote:

                      >> Well we still need implicit sync [...]

                      > Yeah, this isn't about "we don't want
                      implicit sync", it's about "amdgpu

                      > doesn't ensure later jobs fully see the
                      effects of previous implicitly

                      > synced jobs", requiring userspace to do
                      pessimistic flushing.

                      
                      Yes, exactly that.

                      
                      For the background: We also do this flushing for
                      explicit syncs. And 

                      when this was implemented 2-3 years ago we first
                      did the flushing for 

                      implicit sync as well.

                      
                      That was immediately reverted and then implemented
                      differently because 

                      it caused severe performance problems in some use
                      cases.

                      
                      I'm not sure of the root cause of this performance
                      problems. My 

                      assumption was always that we then insert to many
                      pipeline syncs, but 

                      Marek doesn't seem to think it could be that.

                      
                      On the one hand I'm rather keen to remove the
                      extra handling and just 

                      always use the explicit handling for everything
                      because it simplifies 

                      the kernel code quite a bit. On the other hand I
                      don't want to run into 

                      this performance problem again.

                      
                      Additional to that what the kernel does is a
                      "full" pipeline sync, e.g. 

                      we busy wait for the full hardware pipeline to
                      drain. That might be 

                      overkill if you just want to do some flushing so
                      that the next shader 

                      sees the stuff written, but I'm not an expert on
                      that.

                    
                    Do we busy-wait on the CPU or in WAIT_REG_MEM?
                    

                    WAIT_REG_MEM is what UMDs do and should be
                      faster.
                  
                
              We use WAIT_REG_MEM to wait for an EOP fence value to
              reach memory.

              
              We use this for a couple of things, especially to make
              sure that the hardware is idle before changing VMID to
              page table associations.

              
              What about your idea of having an extra dw in the shared
              BOs indicating that they are flushed?

              
              As far as I understand it an EOS or other event might be
              sufficient for the caches as well. And you could insert
              the WAIT_REG_MEM directly before the first draw using the
              texture and not before the whole IB.

              
              Could be that we can optimize this even more than what we
              do in the kernel.

              
              Christian.

            
          Adding fences into BOs would be bad, because all UMDs would
          have to handle them.
      
    
    Yeah, already assumed that this is the biggest blocker.

    
        Is it possible to do this in the ring
          buffer:
        if (fence_signalled) {
        
             indirect_buffer(dependent_IB);

          
             indirect_buffer(other_IB);

        
        } else {
           indirect_buffer(other_IB);
           wait_reg_mem(fence);

        
           indirect_buffer(dependent_IB);

        
        }
    
    
    That's maybe possible, but at least not easily implementable.

    
        Or we might have to wait for a hw
          scheduler.

        
    I'm still fine doing the pipeline sync for implicit sync as well, I
    just need somebody to confirm me that this doesn't backfire in some
    case.

    
          Does the kernel sync when the driver
            fd is different, or when the context is different?
        
      
    Only when the driver fd is different.

    
    Christian.

    
        Marek

        
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx