Re: [PATCH v2] drm/amdgpu: always use legacy tlb flush on cyan_skilfish

Felix Kuehling <felix.kuehling@xxxxxxx> · Fri, 15 Sep 2023 10:49:41 -0400



    On 2023-09-15 6:19, Christian König
      wrote:

    
    Am
      15.09.23 um 10:53 schrieb Lang Yu:
      

      On 09/14/ , Felix Kuehling wrote:
        

        On 2023-09-14 10:02, Christian König
          wrote:
          

        Do we still need to use legacy flush to emulate heavyweight
        flush
        

        if we don't use SVM? And can I push this now?
        

      Felix needs to decide that. From what I understand the KFD needs
      heavyweight flushes for secure SVM operation.
      

    Yes. We need to be able to guarantee to the kernel, that the GPU
      will not access unmapped memory. There are two strategies in the
      driver to do this:
    
      Preempt GPU queues (which implies a heavy-weight TLB flush in
        the scheduler firmware)
      Invalidate page table entries and flush TLBs

      
    #1 happens during MMU notifiers with XNACK off. #2 happens in MMU
      notifiers with XNACK on (not supported on GFX10.x) and when
      unified memory us munmapped. It's that last part I'm worried
      about. When memory is munmapped and given back to the OS, we need
      to be able to guarantee that the GPU won't access it any more. The
      same is true when GTT BOs and userptr BOs are freed. After
      unmapping them from the GPU page tables, we need a heavy-weight
      flush. I believe the same should apply to the graphics driver, but
      maybe that's implied through the CS and fence mechanisms that keep
      memory allocated while the GPU is accessing it.

    
    A legacy flush has a slim chance of not being sufficient because
      memory accesses using old addresses can still be in flight in the
      GPU.

    
      If heavyweight flushes are buggy papering over that by using
      legacy flushes is only a mediocre workaround.
      

    I agree. I'd like to avoid half-baked workarounds that will cause
      more headaches later on. I started an internal email thread with
      Tony to understand the requirements for heavy-weight flushes on
      the affected GPUs and find a better workaround.
    Regards,

        Felix

    
      Regards,
      

      Christian.
      

        Regards,
        

        Lang
        

          Am 14.09.23 um 15:59 schrieb Felix
            Kuehling:
            

            On 2023-09-14 9:39, Christian König
              wrote:
              

              Is a single legacy flush
                sufficient to emulate an heavyweight
                

                flush as well?
                

                On previous generations we needed to issue at least two
                legacy
                

                flushes for this.
                

              I assume you are referring to the Vega20 XGMI workaround.
              That is a
              

              very different issue. Because PTEs would be cached in L2,
              we had to
              

              always use a heavy-weight flush that would also flush the
              L2 cache
              

              as well, and follow that with another legacy flush to deal
              with race
              

              conditions where stale PTEs could be re-fetched from L2
              before the
              

              L2 flush was complete.
              

            No, we also have another (badly documented) workaround which
            issues a
            

            legacy flush before each heavy weight on some hw
            generations. See the my
            

            TLB flush cleanup patches.
            

            A heavy-weight flush guarantees that
              there are no more possible
              

              memory accesses using the old PTEs. With physically
              addressed caches
              

              on GFXv9 that includes a cache flush because the address
              translation
              

              happened before putting data into the cache. I think the
              address
              

              translation and cache architecture works differently on
              GFXv10. So
              

              maybe the cache-flush is not required here.
              

              But even then a legacy flush probably allows for in-flight
              memory
              

              accesses with old physical addresses to complete after the
              TLB
              

              flush. So there is a small risk of memory corruption that
              was
              

              assumed to not be accessed by the GPU any more. Or when
              using IOMMU
              

              device isolation it would result in IOMMU faults if the
              DMA mappings
              

              are invalidated slightly too early.
              

            Mhm, that's quite bad. Any idea how to avoid that?
            

          A few ideas
          

            * Add an arbitrary delay and hope that it is longer than the
          FIFOs in
          

              the HW
          

            * Execute an atomic operation to memory on some GPU engine
          that could
          

              act as a fence, maybe just a RELEASE_MEM on the CP to some
          writeback
          

              location would do the job
          

            * If needed, RELEASE_MEM could also perform a cache flush
          

          Regards,
          

             Felix
          

          Regards,
            

            Christian.
            

            Regards,
              

                 Felix
              

              And please don't push before
                getting an rb from Felix as well.
                

                Regards,
                

                Christian.
                

                Am 14.09.23 um 11:23 schrieb Lang Yu:
                

                cyan_skilfish has problems with
                  other flush types.
                  

                  v2: fix incorrect ternary conditional operator
                  usage.(Yifan)
                  

                  Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx>
                  

                  Cc: <stable@xxxxxxxxxxxxxxx> # v5.15+
                  

                  ---
                  

                     drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 7 ++++++-
                  

                     1 file changed, 6 insertions(+), 1 deletion(-)
                  

                  diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
                  

                  b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
                  

                  index d3da13f4c80e..c6d11047169a 100644
                  

                  --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
                  

                  +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
                  

                  @@ -236,7 +236,8 @@ static void
                  

                  gmc_v10_0_flush_vm_hub(struct amdgpu_device *adev,
                  uint32_t
                  

                  vmid,
                  

                     {
                  

                         bool use_semaphore =
                  

                  gmc_v10_0_use_invalidate_semaphore(adev, vmhub);
                  

                         struct amdgpu_vmhub *hub =
                  &adev->vmhub[vmhub];
                  

                  -    u32 inv_req =
                  

                  hub->vmhub_funcs->get_invalidate_req(vmid,
                  flush_type);
                  

                  +    u32 inv_req =
                  hub->vmhub_funcs->get_invalidate_req(vmid,
                  

                  +              (adev->asic_type !=
                  CHIP_CYAN_SKILLFISH) ?
                  

                  flush_type : 0);
                  

                         u32 tmp;
                  

                         /* Use register 17 for GART */
                  

                         const unsigned int eng = 17;
                  

                  @@ -331,6 +332,8 @@ static void
                  

                  gmc_v10_0_flush_gpu_tlb(struct amdgpu_device *adev,
                  uint32_t
                  

                  vmid,
                  

                           int r;
                  

                     +    flush_type = (adev->asic_type !=
                  CHIP_CYAN_SKILLFISH)
                  

                  ? flush_type : 0;
                  

                  +
                  

                         /* flush hdp cache */
                  

                         adev->hdp.funcs->flush_hdp(adev, NULL);
                  

                     @@ -426,6 +429,8 @@ static int
                  

                  gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device
                  *adev,
                  

                         struct amdgpu_ring *ring =
                  &adev->gfx.kiq[0].ring;
                  

                         struct amdgpu_kiq *kiq =
                  &adev->gfx.kiq[0];
                  

                     +    flush_type = (adev->asic_type !=
                  CHIP_CYAN_SKILLFISH)
                  

                  ? flush_type : 0;
                  

                  +
                  

                         if (amdgpu_emu_mode == 0 &&
                  ring->sched.ready) {
                  

                  spin_lock(&adev->gfx.kiq[0].ring_lock);
                  

                             /* 2 dwords flush + 8 dwords fence */