Re: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages issue

Philip Yang <yangp@xxxxxxx> · Wed, 8 Jan 2025 20:11:39 -0500



    On 2025-01-07 22:08, Deng, Emily wrote:

    
        [AMD Official Use Only - AMD Internal Distribution Only]

      
          Hi
              Philip,
          It
              still has the deadlock, maybe the best way is trying to
              remove the delayed free pt work.
          [Wed
              Jan  8 10:35:44 2025 <    0.000000>] INFO: task
              kfdtest:5827 blocked for more than 122 seconds.
          [Wed
              Jan  8 10:35:44 2025 <    0.000290>]       Tainted:
              G           OE K   5.10.134-17.2.al8.x86_64 #1
          [Wed
              Jan  8 10:35:44 2025 <    0.000243>] "echo 0 >
              /proc/sys/kernel/hung_task_timeout_secs" disables this
              message.
          [Wed
              Jan  8 10:35:44 2025 <    0.000317>]
              task:kfdtest         state:D stack:    0 pid: 5827 ppid: 
              5756 flags:0x00004080
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] Call Trace:
          [Wed
              Jan  8 10:35:44 2025 <    0.000006>] 
              __schedule+0x1ba/0x490
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>]  ?
              usleep_range+0x90/0x90
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] 
              schedule+0x46/0xb0
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] 
              schedule_timeout+0x12a/0x140
          [Wed
              Jan  8 10:35:44 2025 <    0.000003>]  ?
              __prepare_to_swait+0x4f/0x70
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] 
              __wait_for_common+0xb1/0x160
          [Wed
              Jan  8 10:35:44 2025 <    0.000004>] 
              flush_workqueue+0x12f/0x410
          [Wed
              Jan  8 10:35:44 2025 <    0.000126>] 
              svm_range_map_to_gpu+0x1b8/0x730 [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000078>] 
              svm_range_validate_and_map+0x978/0xd30 [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000065>] 
              svm_range_set_attr+0x55f/0xb20 [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000060>] 
              kfd_ioctl+0x208/0x540 [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000058>]  ?
              kfd_ioctl_set_xnack_mode+0xd0/0xd0 [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000004>]  ?
              vm_mmap_pgoff+0xf2/0x120
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] 
              __x64_sys_ioctl+0x88/0xc0
          [Wed
              Jan  8 10:35:44 2025 <    0.000003>] 
              do_syscall_64+0x2e/0x50
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] 
              entry_SYSCALL_64_after_hwframe+0x62/0xc7
          [Wed
              Jan  8 10:35:44 2025 <    0.000008>] RIP:
              0033:0x7f8c472617db
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] RSP:
              002b:00007ffd2908a688 EFLAGS: 00000246 ORIG_RAX:
              0000000000000010
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] RAX:
              ffffffffffffffda RBX: 00007ffd2908a6fc RCX:
              00007f8c472617db
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] RDX:
              00007ffd2908a6c0 RSI: 00000000c0384b20 RDI:
              0000000000000003
          [Wed
              Jan  8 10:35:44 2025 <    0.000000>] RBP:
              00007ffd2908a6c0 R08: 0000000000000000 R09:
              0000000000000000
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] R10:
              00007f70f467f000 R11: 0000000000000246 R12:
              00000000c0384b20
          [Wed
              Jan  8 10:35:44 2025 <    0.000000>] R13:
              0000000000000003 R14: 0000000000200000 R15:
              00007ffd2908a770
          [Wed
              Jan  8 10:35:44 2025 <    0.000003>] INFO: task
              kworker/u129:7:5942 blocked for more than 122 seconds.
          [Wed
              Jan  8 10:35:44 2025 <    0.001897>]       Tainted:
              G           OE K   5.10.134-17.2.al8.x86_64 #1
          [Wed
              Jan  8 10:35:44 2025 <    0.000247>] "echo 0 >
              /proc/sys/kernel/hung_task_timeout_secs" disables this
              message.
          [Wed
              Jan  8 10:35:44 2025 <    0.000315>]
              task:kworker/u129:7  state:D stack:    0 pid: 5942
              ppid:     2 flags:0x00004080
          [Wed
              Jan  8 10:35:44 2025 <    0.000067>] Workqueue:
              amdgpu_recycle amdgpu_vm_pt_free_work [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] Call Trace:
          [Wed
              Jan  8 10:35:44 2025 <    0.000003>] 
              __schedule+0x1ba/0x490
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>]  ?
              newidle_balance+0x16a/0x3b0
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] 
              schedule+0x46/0xb0
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] 
              schedule_preempt_disabled+0xa/0x10
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] 
              __ww_mutex_lock.constprop.0+0x390/0x6e0
          [Wed
              Jan  8 10:35:44 2025 <    0.000045>] 
              amdgpu_vm_pt_free_work+0x97/0x160 [amdgpu]
          [Wed
              Jan  8 10:35:44 2025 <    0.000003>] 
              process_one_work+0x1ad/0x380
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] 
              worker_thread+0x49/0x310
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>]  ?
              process_one_work+0x380/0x380
          [Wed
              Jan  8 10:35:44 2025 <    0.000001>] 
              kthread+0x118/0x140
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>]  ?
              __kthread_bind_mask+0x60/0x60
          [Wed
              Jan  8 10:35:44 2025 <    0.000002>] 
              ret_from_fork+0x1f/0x30
           
           
    Move 
        flush_workqueue to the beginning of svm_range_validate_and_map
        should fix the deadlock, deadlock is because it is after
        svm_range_reserve_bos. Also there is no concurrent unmap mmu
        notifier callback to free pt bo as mmap read lock is taken
        outside svm_range_validate_and_map.

      
    Ideally
        it is enough to flush work amdgpu_vm_pt_free_work
        (not flush queue system_wq), but svm_range_validate_and_map
          cannot get the correct vm to flush.

        
    adev->wq
        is shared by all processes and all xcp partitions, maybe better
        to add wq to KFD process info, but right now
        amdgpu_vm_update_range cannot access KFD process info.
    Regards,
    Philip

      
          Emily
              Deng
          Best
              Wishes
          
            
                From:
                    amd-gfx
                    <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx>
                    On Behalf Of Deng, Emily

                    Sent: Wednesday, January 8, 2025 8:34 AM

                    To: Yang, Philip <Philip.Yang@xxxxxxx>;
                    Kuehling, Felix <Felix.Kuehling@xxxxxxx>;
                    amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Koenig, Christian
                    <Christian.Koenig@xxxxxxx>

                    Subject: RE: [PATCH v2] drm/amdgpu: Fix the
                    looply call svm_range_restore_pages issue
              
            
            [AMD
                Official Use Only - AMD Internal Distribution Only]
             
            
              [AMD
                  Official Use Only - AMD Internal Distribution Only]
               
              
                      From:
                          Yang, Philip <Philip.Yang@xxxxxxx>
                          

                          Sent: Tuesday, January 7, 2025 11:19 PM

                          To: Deng, Emily <Emily.Deng@xxxxxxx>;
                          Kuehling, Felix <Felix.Kuehling@xxxxxxx>;
                          amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
                          Yang, Philip <Philip.Yang@xxxxxxx>;
                          Koenig, Christian <Christian.Koenig@xxxxxxx>

                          Subject: Re: [PATCH v2] drm/amdgpu: Fix
                          the looply call svm_range_restore_pages issue
                    
                  
                    On 2025-01-07 07:30, Deng,
                      Emily wrote:
                  
                  
                    [AMD Official Use Only - AMD Internal Distribution Only]
                     
                    Hi Felix,
                        You are right, it is easily to hit deadlock, don't know why LOCKDEP doesn't catch this. Need to find another solution.
                     
                    Hi Philip,
                         Do you have a solution for this delay free pt?
                  
                  Thanks for debugging this case, I had a patch to
                    not free PTB bo when unmapping from GPU, but it will
                    waste VRAM memory. My test case also passed with the
                    tlb flush fence fix, I don't see the no-retry fault
                    generated any more.
                  The deadlock is probably from
                    svm_range_unmap_from_gpu ->
                    flush_workqueue(adev->wq), this is from mmu
                    notifier callback, actually we only need flush
                    pt_free_work before mapping to gpu, please remove
                    the flush_workqueue in unmap from gpu. If deadlock
                    still happens, please post the backtrace.
                  [Emily]Yes,
                      you are right, will try to remove
                    flush_workqueue in unmap from gpu to have a
                    try. Will send a v3.
                  I think you don't need add new adev->wq, use
                    default system_wq and flush_work.
                  [Emily]No,
                      it doesn’t allow to flush a system_wq in driver,
                      it will trigger a kernel warning, as lots of other
                      work will be put in system_wq. I have tried this.
                  Regards,
                  Philip
                  
                     
                    Emily Deng
                    Best Wishes
                     
                    
                      -----Original Message-----
                      From: Deng, Emily <Emily.Deng@xxxxxxx>
                      Sent: Tuesday, January 7, 2025 10:34 AM
                      To: Deng, Emily <Emily.Deng@xxxxxxx>; Kuehling, Felix
                      <Felix.Kuehling@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip
                      <Philip.Yang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
                      Subject: RE: [PATCH v2] drm/amdgpu: Fix the looply call svm_range_restore_pages
                      issue
                       
                      [AMD Official Use Only - AMD Internal Distribution Only]
                       
                      Ping....
                      How about this? Currently it is easily to reproduce the issue on our environment. We
                      need this change to fix it.
                       
                      Emily Deng
                      Best Wishes
                       
                       
                        -----Original Message-----
                        From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of
                        Deng, Emily
                        Sent: Monday, January 6, 2025 9:52 AM
                        To: Kuehling, Felix <Felix.Kuehling@xxxxxxx>;
                        amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yang, Philip <Philip.Yang@xxxxxxx>;
                        Koenig, Christian <Christian.Koenig@xxxxxxx>
                        Subject: RE: [PATCH v2] drm/amdgpu: Fix the looply call
                        svm_range_restore_pages issue
                         
                        [AMD Official Use Only - AMD Internal Distribution Only]
                         
                        [AMD Official Use Only - AMD Internal Distribution Only]
                         
                        
                          -----Original Message-----
                          From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
                          Sent: Saturday, January 4, 2025 7:18 AM
                          To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
                          Yang, Philip <Philip.Yang@xxxxxxx>; Koenig, Christian
                          <Christian.Koenig@xxxxxxx>
                          Subject: Re: [PATCH v2] drm/amdgpu: Fix the looply call
                          svm_range_restore_pages issue
                           
                           
                          On 2025-01-02 21:26, Emily Deng wrote:
                          
                            As the delayed free pt, the wanted freed bo has been reused which
                            will cause unexpected page fault, and then call svm_range_restore_pages.
                             
                            Detail as below:
                            1.It wants to free the pt in follow code, but it is not freed
                            immediately and used “schedule_work(&vm->pt_free_work);”.
                             
                            [   92.276838] Call Trace:
                            [   92.276841]  dump_stack+0x63/0xa0
                            [   92.276887]  amdgpu_vm_pt_free_list+0xfb/0x120 [amdgpu]
                            [   92.276932]  amdgpu_vm_update_range+0x69c/0x8e0 [amdgpu]
                            [   92.276990]  svm_range_unmap_from_gpus+0x112/0x310 [amdgpu]
                            [   92.277046]  svm_range_cpu_invalidate_pagetables+0x725/0x780 [amdgpu]
                            [   92.277050]  ? __alloc_pages_nodemask+0x19f/0x3e0
                            [   92.277051]  mn_itree_invalidate+0x72/0xc0
                            [   92.277052]  __mmu_notifier_invalidate_range_start+0x48/0x60
                            [   92.277054]  migrate_vma_collect+0xf6/0x100
                            [   92.277055]  migrate_vma_setup+0xcf/0x120
                            [   92.277109]  svm_migrate_ram_to_vram+0x256/0x6b0 [amdgpu]
                             
                            2.Call svm_range_map_to_gpu->amdgpu_vm_update_range to update the
                            page table, at this time it will use the same entry bo which is the
                            want free bo in step1.
                             
                            3.Then it executes the pt_free_work to free the bo. At this time,
                            the GPU access memory will cause page fault as pt bo has been freed.
                            And then it will call svm_range_restore_pages again.
                             
                            How to fix?
                            Add a workqueue, and flush the workqueue each time before updating page
                          
                        
                      table.
                      
                        
                          I think this is kind of a known issue in the GPUVM code. Philip was
                          looking at it before.
                           
                          Just flushing a workqueue may seem like a simple and elegant solution,
                          but I'm afraid it introduces lock dependency issues. By flushing the
                          workqueue, you're effectively creating a lock dependency of the MMU
                          notifier on any locks held inside the worker function. You now get a
                          circular lock dependency with any of those locks and memory reclaim. I
                          think LOCKDEP would be able to catch that if you compile your kernel
                          with that
                        
                        feature enabled.
                        
                           
                          The proper solution is to prevent delayed freeing of page tables if
                          they happened to get reused, or prevent reuse of page tables if they
                          are flagged for
                        
                        delayed freeing.
                        
                           
                          Regards,
                            Felix
                           
                        
                        Thanks, already enabled LOCKDEP while compiling the kernel. The delay
                        work seems for other reasons, I am not sure whether it could be deleted completely.
                         
                        Emily Deng
                        Best Wishes
                        
                           
                            Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx>
                            ---
                              drivers/gpu/drm/amd/amdgpu/amdgpu.h              | 1 +
                              drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 1 +
                              drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 7 +++++--
                              drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c        | 6 +++++-
                              drivers/gpu/drm/amd/amdkfd/kfd_svm.c             | 3 +++
                              5 files changed, 15 insertions(+), 3 deletions(-)
                             
                            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
                            b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
                            index 93c352b08969..cbf68ad1c8d0 100644
                            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
                            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
                            @@ -1188,6 +1188,7 @@ struct amdgpu_device {
                                 struct mutex                    enforce_isolation_mutex;
                             
                                 struct amdgpu_init_level *init_lvl;
                            +    struct workqueue_struct *wq;
                              };
                             
                              static inline uint32_t amdgpu_ip_version(const struct
                            amdgpu_device *adev, diff --git
                            a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
                            index f30548f4c3b3..5b4835bc81b3 100644
                            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
                            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
                            @@ -2069,6 +2069,7 @@ int amdgpu_amdkfd_gpuvm_map_memory_to_gpu(
                                         if (ret)
                                                 goto out;
                                 }
                            +    flush_workqueue(adev->wq);
                             
                                 ret = reserve_bo_and_vm(mem, avm, &ctx);
                                 if (unlikely(ret))
                            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                            index 9d6ffe38b48a..500d97cd9114 100644
                            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
                            @@ -2607,7 +2607,7 @@ void amdgpu_vm_fini(struct amdgpu_device
                            *adev,
                          
                          struct amdgpu_vm *vm)
                          
                                 amdgpu_amdkfd_gpuvm_destroy_cb(adev, vm);
                             
                                 flush_work(&vm->pt_free_work);
                            -
                            +    cancel_work_sync(&vm->pt_free_work);
                                 root = amdgpu_bo_ref(vm->root.bo);
                                 amdgpu_bo_reserve(root, true);
                                 amdgpu_vm_put_task_info(vm->task_info);
                            @@ -2708,6 +2708,8 @@ void amdgpu_vm_manager_init(struct
                            amdgpu_device
                          
                          *adev)
                          
                              #endif
                             
                                 xa_init_flags(&adev->vm_manager.pasids, XA_FLAGS_LOCK_IRQ);
                            +    adev->wq = alloc_workqueue("amdgpu_recycle",
                            +                               WQ_MEM_RECLAIM | WQ_HIGHPRI |
                          
                          WQ_UNBOUND, 16);
                          
                              }
                             
                              /**
                            @@ -2721,7 +2723,8 @@ void amdgpu_vm_manager_fini(struct
                            amdgpu_device
                          
                          *adev)
                          
                              {
                                 WARN_ON(!xa_empty(&adev->vm_manager.pasids));
                                 xa_destroy(&adev->vm_manager.pasids);
                            -
                            +    flush_workqueue(adev->wq);
                            +    destroy_workqueue(adev->wq);
                                 amdgpu_vmid_mgr_fini(adev);
                              }
                             
                            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
                            b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
                            index f78a0434a48f..1204406215ee 100644
                            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
                            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
                            @@ -554,15 +554,19 @@ void amdgpu_vm_pt_free_work(struct work_struct
                            *work)
                             
                                 vm = container_of(work, struct amdgpu_vm, pt_free_work);
                             
                            +    printk("Emily:%s\n", __func__);
                                 spin_lock(&vm->status_lock);
                                 list_splice_init(&vm->pt_freed, &pt_freed);
                                 spin_unlock(&vm->status_lock);
                            +    printk("Emily:%s 1\n", __func__);
                             
                                 /* flush_work in amdgpu_vm_fini ensure vm->root.bo is valid. */
                                 amdgpu_bo_reserve(vm->root.bo, true);
                            +    printk("Emily:%s 2\n", __func__);
                             
                                 list_for_each_entry_safe(entry, next, &pt_freed, vm_status)
                                         amdgpu_vm_pt_free(entry);
                            +    printk("Emily:%s 3\n", __func__);
                             
                                 amdgpu_bo_unreserve(vm->root.bo);
                              }
                            @@ -589,7 +593,7 @@ void amdgpu_vm_pt_free_list(struct amdgpu_device
                          
                          *adev,
                          
                                         spin_lock(&vm->status_lock);
                                         list_splice_init(&params->tlb_flush_waitlist, &vm->pt_freed);
                                         spin_unlock(&vm->status_lock);
                            -            schedule_work(&vm->pt_free_work);
                            +            queue_work(adev->wq, &vm->pt_free_work);
                                         return;
                                 }
                             
                            diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
                            b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
                            index 3e2911895c74..55edf96d5a95 100644
                            --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
                            +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
                            @@ -1314,6 +1314,7 @@ svm_range_unmap_from_gpu(struct amdgpu_device
                          
                          *adev, struct amdgpu_vm *vm,
                          
                                 uint64_t init_pte_value = 0;
                             
                                 pr_debug("[0x%llx 0x%llx]\n", start, last);
                            +    flush_workqueue(adev->wq);
                             
                                 return amdgpu_vm_update_range(adev, vm, false, true, true,
                            false, NULL,
                          
                          start,
                          
                                                               last, init_pte_value, 0, 0, NULL,
                            NULL, @@ -1422,6
                          
                          +1423,8
                          
                            @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, struct
                            svm_range
                          
                          *prange,
                          
                                          * different memory partition based on fpfn/lpfn, we should use
                                          * same vm_manager.vram_base_offset regardless memory partition.
                                          */
                            +            flush_workqueue(adev->wq);
                            +
                                         r = amdgpu_vm_update_range(adev, vm, false, false, flush_tlb, true,
                                                                    NULL, last_start, prange->start + i,
                                                                    pte_flags,