Re: [PATCH 1/2] drm/amdkfd: wait migration done only if migration starts

philip yang <yangp@xxxxxxx> · Wed, 5 May 2021 13:54:35 -0400

On 2021-04-29 2:10 a.m., Felix Kuehling
      wrote:

    
    
      Am 2021-04-28 um 9:53 p.m. schrieb Philip Yang:


      
        If migration vma setup, but failed before start sdma memory copy, e.g.
process is killed, don't wait for sdma fence done.

      
      I think you could describe this more generally as "Handle errors
returned by svm_migrate_copy_to_vram/ram".



      
        Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx>
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 6b810863f6ba..19b08247ba8a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -460,10 +460,12 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
 	}
 
 	if (migrate.cpages) {
-		svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence,
-					 scratch);
-		migrate_vma_pages(&migrate);
-		svm_migrate_copy_done(adev, mfence);
+		r = svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence,
+					     scratch);
+		if (!r) {
+			migrate_vma_pages(&migrate);
+			svm_migrate_copy_done(adev, mfence);

      
      I think there are failure cases where svm_migrate_copy_to_vram
successfully copies some pages but fails somewhere in the middle. I
think in those cases you still want to call migrate_vma_pages and
svm_migrate_copy_done. If the copy never started for some reason, there
should be no mfence and svm_migrate_copy_done should be a no-op.

I probably don't understand the failure scenario you encountered. Can
you explain that in more detail?
    
    I had below backtrace, but cannot repro it again, use ctrl-c to
      kill process while handling GPU retry fault. I will send new patch
      to fix the WARNING, the "amdgpu: qcm fence wait loop timeout
      expired" and hang issue log is something else, not caused by
      svm_migrate_copy_done wait fence.

    
    [   58.822450] VRAM BO missing during validation

      [   58.822488] WARNING: CPU: 3 PID: 2544 at
/home/yangp/git/compute_staging/kernel/drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:1376
      svm_range_validate_and_map+0xeea/0xf30 [amdgpu]

      [   58.822820] Modules linked in: xt_multiport iptable_filter
      ip6table_filter ip6_tables fuse i2c_piix4 k10temp ip_tables
      x_tables amdgpu iommu_v2 gpu_sched ast drm_vram_helper
      drm_ttm_helper ttm

      [   58.822902] CPU: 3 PID: 2544 Comm: kworker/3:2 Not tainted
      5.11.0-kfd-yangp #1420

      [   58.822912] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00,
      BIOS F12 08/05/2019

      [   58.822918] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]

      [   58.823197] RIP: 0010:svm_range_validate_and_map+0xeea/0xf30
      [amdgpu]

      [   58.823504] Code: 8c b7 41 ec 41 be ea ff ff ff e9 20 fc ff ff
      be 01 00 00 00 e8 57 27 3f ec e9 20 fe ff ff 48 c7 c7 40 7f 61 c0
      e8 d6 54 d7 eb <0f> 0b 41 be ea ff ff ff e9 81 f3 ff ff 89
      c2 48 c7 c6 c8 81 61 c0

      [   58.823513] RSP: 0018:ffffb2f740677850 EFLAGS: 00010286

      [   58.823524] RAX: 0000000000000000 RBX: ffff89a2902aa800 RCX:
      0000000000000027

      [   58.823531] RDX: 0000000000000000 RSI: ffff89a96cc980b0 RDI:
      ffff89a96cc980b8

      [   58.823536] RBP: ffff89a286f9f500 R08: 0000000000000001 R09:
      0000000000000001

      [   58.823542] R10: ffffb2f740677ab8 R11: ffffb2f740677660 R12:
      0000000555558e00

      [   58.823548] R13: ffff89a2902aaca0 R14: ffff89a289209000 R15:
      ffff89a289209000

      [   58.823554] FS:  0000000000000000(0000)
      GS:ffff89a96cc80000(0000) knlGS:0000000000000000

      [   58.823561] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

      [   58.823567] CR2: 00007ffff7d91000 CR3: 000000013930e000 CR4:
      00000000003506e0

      [   58.823573] Call Trace:

      [   58.823587]  ? __lock_acquire+0x351/0x1a70

      [   58.823599]  ? __lock_acquire+0x351/0x1a70

      [   58.823614]  ? __lock_acquire+0x351/0x1a70

      [   58.823634]  ? __lock_acquire+0x351/0x1a70

      [   58.823641]  ? __lock_acquire+0x351/0x1a70

      [   58.823663]  ? lock_acquire+0x242/0x390

      [   58.823670]  ? free_one_page+0x3c/0x4b0

      [   58.823687]  ? get_object+0x50/0x50

      [   58.823708]  ? mark_held_locks+0x49/0x70

      [   58.823715]  ? mark_held_locks+0x49/0x70

      [   58.823725]  ? lockdep_hardirqs_on_prepare+0xd4/0x170

      [   58.823733]  ? __free_pages_ok+0x360/0x480

      [   58.823753]  ? svm_migrate_ram_to_vram+0x30f/0xa40 [amdgpu]

      [   58.824072]  ? mark_held_locks+0x49/0x70

      [   58.824096]  svm_range_restore_pages+0x608/0x950 [amdgpu]

      [   58.824410]  amdgpu_vm_handle_fault+0xa9/0x3c0 [amdgpu]

      [   58.824673]  gmc_v9_0_process_interrupt+0xa8/0x410 [amdgpu]

      [   58.824945]  ? amdgpu_device_skip_hw_access+0x6b/0x70 [amdgpu]

      [   58.825191]  ? amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]

      [   58.825462]  amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]

      [   58.825743]  amdgpu_ih_process+0x7b/0xe0 [amdgpu]

      [   58.826106]  process_one_work+0x2a2/0x620

      [   58.826146]  ? process_one_work+0x620/0x620

      [   58.826165]  worker_thread+0x39/0x3f0

      [   58.826188]  ? process_one_work+0x620/0x620

      [   58.826205]  kthread+0x131/0x150

      [   58.826223]  ? kthread_park+0x90/0x90

      [   58.826245]  ret_from_fork+0x1f/0x30

      [   58.826292] irq event stamp: 2358517

      [   58.826301] hardirqs last  enabled at (2358523):
      [<ffffffffac100657>] console_unlock+0x487/0x580

      [   58.826313] hardirqs last disabled at (2358528):
      [<ffffffffac1005b3>] console_unlock+0x3e3/0x580

      [   58.826326] softirqs last  enabled at (2358470):
      [<ffffffffad000306>] __do_softirq+0x306/0x429

      [   58.826341] softirqs last disabled at (2358449):
      [<fffffffface00f8f>] asm_call_irq_on_stack+0xf/0x20

      [   58.826355] ---[ end trace ddec9ce1cb4ea7fc ]---

      [   67.807478] amdgpu: qcm fence wait loop timeout expired

      [  242.302930] INFO: task khugepaged:514 blocked for more than 120
      seconds.

      [  242.303237]       Tainted: G        W         5.11.0-kfd-yangp
      #1420

      [  242.303248] "echo 0 >
      /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      [  242.303256] task:khugepaged      state:D stack:    0 pid:  514
      ppid:     2 flags:0x00004000

      [  242.303270] Call Trace:

      [  242.303281]  __schedule+0x31a/0x9f0

      [  242.303300]  ? wait_for_completion+0x87/0x120

      [  242.303310]  schedule+0x51/0xc0

      [  242.303318]  schedule_timeout+0x193/0x360

      [  242.303331]  ? mark_held_locks+0x49/0x70

      [  242.303339]  ? mark_held_locks+0x49/0x70

      [  242.303347]  ? wait_for_completion+0x87/0x120

      [  242.303354]  ? lockdep_hardirqs_on_prepare+0xd4/0x170

      [  242.303364]  ? wait_for_completion+0x87/0x120

      [  242.303372]  wait_for_completion+0xba/0x120

      [  242.303385]  __flush_work+0x273/0x480

      [  242.303398]  ? flush_workqueue_prep_pwqs+0x140/0x140

      [  242.303423]  ? lru_add_drain+0x110/0x110

      [  242.303434]  lru_add_drain_all+0x172/0x1e0

      [  242.303447]  khugepaged+0x68/0x2d10

      [  242.303481]  ? wait_woken+0xa0/0xa0

      [  242.303496]  ? collapse_pte_mapped_thp+0x3f0/0x3f0

      [  242.303503]  kthread+0x131/0x150

      [  242.303512]  ? kthread_park+0x90/0x90

      [  242.303523]  ret_from_fork+0x1f/0x30

      [  242.303665] 

                     Showing all locks held in the system:

      [  242.303679] 1 lock held by khungtaskd/508:

      [  242.303684]  #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
      at: debug_show_all_locks+0xe/0x1b0

      [  242.303713] 1 lock held by khugepaged/514:

      [  242.303718]  #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
      lru_add_drain_all+0x37/0x1e0

      [  242.303756] 6 locks held by kworker/3:2/2544:

      [  242.303764] 1 lock held by in:imklog/2733:

      [  242.303769]  #0: ffff89a2928e58f0
      (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50

      [  242.303838] 1 lock held by dmesg/4262:

      [  242.303843]  #0: ffff89a3079980d0
      (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0

      

      [  242.303875] =============================================

      

      [  311.585542] loop0: detected capacity change from 8 to 0

      [  363.135280] INFO: task khugepaged:514 blocked for more than 241
      seconds.

      [  363.135304]       Tainted: G        W         5.11.0-kfd-yangp
      #1420

      [  363.135313] "echo 0 >
      /proc/sys/kernel/hung_task_timeout_secs" disables this message.

      [  363.135321] task:khugepaged      state:D stack:    0 pid:  514
      ppid:     2 flags:0x00004000

      [  363.135336] Call Trace:

      [  363.135347]  __schedule+0x31a/0x9f0

      [  363.135365]  ? wait_for_completion+0x87/0x120

      [  363.135375]  schedule+0x51/0xc0

      [  363.135383]  schedule_timeout+0x193/0x360

      [  363.135395]  ? mark_held_locks+0x49/0x70

      [  363.135403]  ? mark_held_locks+0x49/0x70

      [  363.135412]  ? wait_for_completion+0x87/0x120

      [  363.135419]  ? lockdep_hardirqs_on_prepare+0xd4/0x170

      [  363.135428]  ? wait_for_completion+0x87/0x120

      [  363.135436]  wait_for_completion+0xba/0x120

      [  363.135448]  __flush_work+0x273/0x480

      [  363.135462]  ? flush_workqueue_prep_pwqs+0x140/0x140

      [  363.135486]  ? lru_add_drain+0x110/0x110

      [  363.135498]  lru_add_drain_all+0x172/0x1e0

      [  363.135511]  khugepaged+0x68/0x2d10

      [  363.135544]  ? wait_woken+0xa0/0xa0

      [  363.135558]  ? collapse_pte_mapped_thp+0x3f0/0x3f0

      [  363.135566]  kthread+0x131/0x150

      [  363.135575]  ? kthread_park+0x90/0x90

      [  363.135586]  ret_from_fork+0x1f/0x30

      [  363.135718] 

                     Showing all locks held in the system:

      [  363.135731] 1 lock held by khungtaskd/508:

      [  363.135737]  #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
      at: debug_show_all_locks+0xe/0x1b0

      [  363.135765] 1 lock held by khugepaged/514:

      [  363.135771]  #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
      lru_add_drain_all+0x37/0x1e0

      [  363.135810] 5 locks held by kworker/3:2/2544:

      [  363.135818] 1 lock held by in:imklog/2733:

      [  363.135823]  #0: ffff89a2928e58f0
      (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50

      [  363.135887] 1 lock held by dmesg/4262:

      [  363.135892]  #0: ffff89a3079980d0
      (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0

    
    
      
Thanks,
  Felix



      
        +		}
 		migrate_vma_finalize(&migrate);
 	}
 
@@ -663,10 +665,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 	pr_debug("cpages %ld\n", migrate.cpages);
 
 	if (migrate.cpages) {
-		svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence,
-					scratch);
-		migrate_vma_pages(&migrate);
-		svm_migrate_copy_done(adev, mfence);
+		r = svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence,
+					    scratch);
+		if (!r) {
+			migrate_vma_pages(&migrate);
+			svm_migrate_copy_done(adev, mfence);
+		}
 		migrate_vma_finalize(&migrate);
 	} else {
 		pr_debug("failed collect migrate device pages [0x%lx 0x%lx]\n",

      
    
  

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx