Am 2021-04-28 um 9:53 p.m. schrieb Philip Yang:If migration vma setup, but failed before start sdma memory copy, e.g. process is killed, don't wait for sdma fence done.I think you could describe this more generally as "Handle errors returned by svm_migrate_copy_to_vram/ram".Signed-off-by: Philip Yang <Philip.Yang@xxxxxxx> --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 6b810863f6ba..19b08247ba8a 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -460,10 +460,12 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange, } if (migrate.cpages) { - svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence, - scratch); - migrate_vma_pages(&migrate); - svm_migrate_copy_done(adev, mfence); + r = svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence, + scratch); + if (!r) { + migrate_vma_pages(&migrate); + svm_migrate_copy_done(adev, mfence);I think there are failure cases where svm_migrate_copy_to_vram successfully copies some pages but fails somewhere in the middle. I think in those cases you still want to call migrate_vma_pages and svm_migrate_copy_done. If the copy never started for some reason, there should be no mfence and svm_migrate_copy_done should be a no-op. I probably don't understand the failure scenario you encountered. Can you explain that in more detail?
I had below backtrace, but cannot repro it again, use ctrl-c to
kill process while handling GPU retry fault. I will send new patch
to fix the WARNING, the "amdgpu: qcm fence wait loop timeout
expired" and hang issue log is something else, not caused by
svm_migrate_copy_done wait fence.
[ 58.822450] VRAM BO missing during validation
[ 58.822488] WARNING: CPU: 3 PID: 2544 at
/home/yangp/git/compute_staging/kernel/drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:1376
svm_range_validate_and_map+0xeea/0xf30 [amdgpu]
[ 58.822820] Modules linked in: xt_multiport iptable_filter
ip6table_filter ip6_tables fuse i2c_piix4 k10temp ip_tables
x_tables amdgpu iommu_v2 gpu_sched ast drm_vram_helper
drm_ttm_helper ttm
[ 58.822902] CPU: 3 PID: 2544 Comm: kworker/3:2 Not tainted
5.11.0-kfd-yangp #1420
[ 58.822912] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00,
BIOS F12 08/05/2019
[ 58.822918] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]
[ 58.823197] RIP: 0010:svm_range_validate_and_map+0xeea/0xf30
[amdgpu]
[ 58.823504] Code: 8c b7 41 ec 41 be ea ff ff ff e9 20 fc ff ff
be 01 00 00 00 e8 57 27 3f ec e9 20 fe ff ff 48 c7 c7 40 7f 61 c0
e8 d6 54 d7 eb <0f> 0b 41 be ea ff ff ff e9 81 f3 ff ff 89
c2 48 c7 c6 c8 81 61 c0
[ 58.823513] RSP: 0018:ffffb2f740677850 EFLAGS: 00010286
[ 58.823524] RAX: 0000000000000000 RBX: ffff89a2902aa800 RCX:
0000000000000027
[ 58.823531] RDX: 0000000000000000 RSI: ffff89a96cc980b0 RDI:
ffff89a96cc980b8
[ 58.823536] RBP: ffff89a286f9f500 R08: 0000000000000001 R09:
0000000000000001
[ 58.823542] R10: ffffb2f740677ab8 R11: ffffb2f740677660 R12:
0000000555558e00
[ 58.823548] R13: ffff89a2902aaca0 R14: ffff89a289209000 R15:
ffff89a289209000
[ 58.823554] FS: 0000000000000000(0000)
GS:ffff89a96cc80000(0000) knlGS:0000000000000000
[ 58.823561] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 58.823567] CR2: 00007ffff7d91000 CR3: 000000013930e000 CR4:
00000000003506e0
[ 58.823573] Call Trace:
[ 58.823587] ? __lock_acquire+0x351/0x1a70
[ 58.823599] ? __lock_acquire+0x351/0x1a70
[ 58.823614] ? __lock_acquire+0x351/0x1a70
[ 58.823634] ? __lock_acquire+0x351/0x1a70
[ 58.823641] ? __lock_acquire+0x351/0x1a70
[ 58.823663] ? lock_acquire+0x242/0x390
[ 58.823670] ? free_one_page+0x3c/0x4b0
[ 58.823687] ? get_object+0x50/0x50
[ 58.823708] ? mark_held_locks+0x49/0x70
[ 58.823715] ? mark_held_locks+0x49/0x70
[ 58.823725] ? lockdep_hardirqs_on_prepare+0xd4/0x170
[ 58.823733] ? __free_pages_ok+0x360/0x480
[ 58.823753] ? svm_migrate_ram_to_vram+0x30f/0xa40 [amdgpu]
[ 58.824072] ? mark_held_locks+0x49/0x70
[ 58.824096] svm_range_restore_pages+0x608/0x950 [amdgpu]
[ 58.824410] amdgpu_vm_handle_fault+0xa9/0x3c0 [amdgpu]
[ 58.824673] gmc_v9_0_process_interrupt+0xa8/0x410 [amdgpu]
[ 58.824945] ? amdgpu_device_skip_hw_access+0x6b/0x70 [amdgpu]
[ 58.825191] ? amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]
[ 58.825462] amdgpu_irq_dispatch+0xc2/0x250 [amdgpu]
[ 58.825743] amdgpu_ih_process+0x7b/0xe0 [amdgpu]
[ 58.826106] process_one_work+0x2a2/0x620
[ 58.826146] ? process_one_work+0x620/0x620
[ 58.826165] worker_thread+0x39/0x3f0
[ 58.826188] ? process_one_work+0x620/0x620
[ 58.826205] kthread+0x131/0x150
[ 58.826223] ? kthread_park+0x90/0x90
[ 58.826245] ret_from_fork+0x1f/0x30
[ 58.826292] irq event stamp: 2358517
[ 58.826301] hardirqs last enabled at (2358523):
[<ffffffffac100657>] console_unlock+0x487/0x580
[ 58.826313] hardirqs last disabled at (2358528):
[<ffffffffac1005b3>] console_unlock+0x3e3/0x580
[ 58.826326] softirqs last enabled at (2358470):
[<ffffffffad000306>] __do_softirq+0x306/0x429
[ 58.826341] softirqs last disabled at (2358449):
[<fffffffface00f8f>] asm_call_irq_on_stack+0xf/0x20
[ 58.826355] ---[ end trace ddec9ce1cb4ea7fc ]---
[ 67.807478] amdgpu: qcm fence wait loop timeout expired
[ 242.302930] INFO: task khugepaged:514 blocked for more than 120
seconds.
[ 242.303237] Tainted: G W 5.11.0-kfd-yangp
#1420
[ 242.303248] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.303256] task:khugepaged state:D stack: 0 pid: 514
ppid: 2 flags:0x00004000
[ 242.303270] Call Trace:
[ 242.303281] __schedule+0x31a/0x9f0
[ 242.303300] ? wait_for_completion+0x87/0x120
[ 242.303310] schedule+0x51/0xc0
[ 242.303318] schedule_timeout+0x193/0x360
[ 242.303331] ? mark_held_locks+0x49/0x70
[ 242.303339] ? mark_held_locks+0x49/0x70
[ 242.303347] ? wait_for_completion+0x87/0x120
[ 242.303354] ? lockdep_hardirqs_on_prepare+0xd4/0x170
[ 242.303364] ? wait_for_completion+0x87/0x120
[ 242.303372] wait_for_completion+0xba/0x120
[ 242.303385] __flush_work+0x273/0x480
[ 242.303398] ? flush_workqueue_prep_pwqs+0x140/0x140
[ 242.303423] ? lru_add_drain+0x110/0x110
[ 242.303434] lru_add_drain_all+0x172/0x1e0
[ 242.303447] khugepaged+0x68/0x2d10
[ 242.303481] ? wait_woken+0xa0/0xa0
[ 242.303496] ? collapse_pte_mapped_thp+0x3f0/0x3f0
[ 242.303503] kthread+0x131/0x150
[ 242.303512] ? kthread_park+0x90/0x90
[ 242.303523] ret_from_fork+0x1f/0x30
[ 242.303665]
Showing all locks held in the system:
[ 242.303679] 1 lock held by khungtaskd/508:
[ 242.303684] #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
at: debug_show_all_locks+0xe/0x1b0
[ 242.303713] 1 lock held by khugepaged/514:
[ 242.303718] #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
lru_add_drain_all+0x37/0x1e0
[ 242.303756] 6 locks held by kworker/3:2/2544:
[ 242.303764] 1 lock held by in:imklog/2733:
[ 242.303769] #0: ffff89a2928e58f0
(&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50
[ 242.303838] 1 lock held by dmesg/4262:
[ 242.303843] #0: ffff89a3079980d0
(&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0
[ 242.303875] =============================================
[ 311.585542] loop0: detected capacity change from 8 to 0
[ 363.135280] INFO: task khugepaged:514 blocked for more than 241
seconds.
[ 363.135304] Tainted: G W 5.11.0-kfd-yangp
#1420
[ 363.135313] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 363.135321] task:khugepaged state:D stack: 0 pid: 514
ppid: 2 flags:0x00004000
[ 363.135336] Call Trace:
[ 363.135347] __schedule+0x31a/0x9f0
[ 363.135365] ? wait_for_completion+0x87/0x120
[ 363.135375] schedule+0x51/0xc0
[ 363.135383] schedule_timeout+0x193/0x360
[ 363.135395] ? mark_held_locks+0x49/0x70
[ 363.135403] ? mark_held_locks+0x49/0x70
[ 363.135412] ? wait_for_completion+0x87/0x120
[ 363.135419] ? lockdep_hardirqs_on_prepare+0xd4/0x170
[ 363.135428] ? wait_for_completion+0x87/0x120
[ 363.135436] wait_for_completion+0xba/0x120
[ 363.135448] __flush_work+0x273/0x480
[ 363.135462] ? flush_workqueue_prep_pwqs+0x140/0x140
[ 363.135486] ? lru_add_drain+0x110/0x110
[ 363.135498] lru_add_drain_all+0x172/0x1e0
[ 363.135511] khugepaged+0x68/0x2d10
[ 363.135544] ? wait_woken+0xa0/0xa0
[ 363.135558] ? collapse_pte_mapped_thp+0x3f0/0x3f0
[ 363.135566] kthread+0x131/0x150
[ 363.135575] ? kthread_park+0x90/0x90
[ 363.135586] ret_from_fork+0x1f/0x30
[ 363.135718]
Showing all locks held in the system:
[ 363.135731] 1 lock held by khungtaskd/508:
[ 363.135737] #0: ffffffffad94f220 (rcu_read_lock){....}-{1:2},
at: debug_show_all_locks+0xe/0x1b0
[ 363.135765] 1 lock held by khugepaged/514:
[ 363.135771] #0: ffffffffad977c08 (lock#5){+.+.}-{3:3}, at:
lru_add_drain_all+0x37/0x1e0
[ 363.135810] 5 locks held by kworker/3:2/2544:
[ 363.135818] 1 lock held by in:imklog/2733:
[ 363.135823] #0: ffff89a2928e58f0
(&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x45/0x50
[ 363.135887] 1 lock held by dmesg/4262:
[ 363.135892] #0: ffff89a3079980d0
(&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x4a/0x2d0
Thanks, Felix+ } migrate_vma_finalize(&migrate); } @@ -663,10 +665,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, pr_debug("cpages %ld\n", migrate.cpages); if (migrate.cpages) { - svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence, - scratch); - migrate_vma_pages(&migrate); - svm_migrate_copy_done(adev, mfence); + r = svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence, + scratch); + if (!r) { + migrate_vma_pages(&migrate); + svm_migrate_copy_done(adev, mfence); + } migrate_vma_finalize(&migrate); } else { pr_debug("failed collect migrate device pages [0x%lx 0x%lx]\n",
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx