Hi Felix,
Thanks for your review and suggestion, but unfortunately the
AMDGPU_GEM_DOMAIN_PREEMPTIBLE is not defined in amdgpu_drm.h. I
understand we need the memory eviction on either kfd_mem_dmamap_dmabuf()
or kfd_mem_dmaunmap_dmabuf() to update DMA address, so I am thinking to
do it as simply as userptr memory does.
The purpose for this change is for non-MES HW scheduler we are using
userptr/paged memory, but since GFX11 we will be using MES scheduler and
it needs the memory to be allocated as GTT/non-paged memory, so we want
all GPUs using GTT/non-paged memory, but there is performance drop,
because of eviction in kfd_mem_dmaunmap_dmabuf.
Currently userptr memory is evicted in kfd_mem_dmamap_userptr as
changing domain to GTT before calling ttm_bo_validate, and not evicted
in kfd_mem_dmamap_userptr, so I think we can do the similar way for
GTT/non-paged memory that setting domain to CPU in
kfd_mem_dmamap_dmabuf, which will evict memory to update DMA address,
and setting domain to GTT in kfd_mem_dmaunmap_dmabuf, which will not
evict memory. The performance should be the same as userptr/paged memory.
Regards,
Eric
On 2023-04-04 16:40, Felix Kuehling wrote:
[+Christian]
OK, this comes from the ttm_bo_wait_ctx call in this section of
amdgpu_bo_move:
if ((old_mem->mem_type == TTM_PL_TT ||
old_mem->mem_type == AMDGPU_PL_PREEMPT) &&
new_mem->mem_type == TTM_PL_SYSTEM) {
r = ttm_bo_wait_ctx(bo, ctx);
if (r)
return r;
amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm);
ttm_resource_free(bo, &bo->resource);
ttm_bo_assign_mem(bo, new_mem);
goto out;
}
We can't just remove this wait. It's not even specific to KFD or
DMABuf imports. We also can't just change it to avoid waiting for
eviction fences because it's also used for GTT BOs (e.g. before a BO
gets swapped under extreme memory pressure). So we also need to
trigger the eviction fence in general case.
In the specific case of DMABuf imports, they share the reservation
object with the original BO. So waiting on the reservation triggers
the eviction fence on the original BO. I think we want to avoid the
waiting on eviction fences for all BOs where the underlying memory is
managed by some other BO, and at the same time also avoid ever
evicting the DMABuf import BO. That's what AMDGPU_PL_PREEMPT is for.
So I think a combination of two changes should to the trick:
1. Change kfd_mem_dmamap_dmabuf to use AMDGPU_GEM_DOMAIN_PREEMPTIBLE
2. Add a special case in the above if-block for old_mem->mem_type ==
AMDGPU_PL_PREEMPT: use amdgpu_bo_sync_wait with
owner=AMDGPU_FENCE_OWNER_KFD so that it doesn't wait for eviction
fences
Regards,
Felix
Am 2023-04-04 um 10:36 schrieb Eric Huang:
Here is the backtrace from Jira:
Thu Nov 10 13:10:23 2022] Scheduling eviction of pid 97784 in 0 jiffies
[Thu Nov 10 13:10:23 2022] WARNING: CPU: 173 PID: 97784 at
/var/lib/dkms/amdgpu/5.16.9.22.20-1438746~20.04/build/amd/amdgpu/../amdkfd/kfd_device.c:878
kgd2kfd_schedule_evict_and_restore_process+0x104/0x120 [amdgpu]
[Thu Nov 10 13:10:23 2022] Modules linked in: veth amdgpu(OE)
amddrm_ttm_helper(OE) amdttm(OE) iommu_v2 amd_sched(OE) amdkcl(OE)
xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user
xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 bpfilter br_netfilter bridge stp llc
aufs overlay binfmt_misc nls_iso8859_1 dm_multipath scsi_dh_rdac
scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common amd64_edac
edac_mce_amd kvm_amd kvm efi_pstore rapl ipmi_ssif ccp acpi_ipmi
k10temp ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel msr
ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10
raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx
xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs
ib_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel
crypto_simd cryptd ast drm_vram_helper drm_ttm_helper ttm mlx5_core
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
[Thu Nov 10 13:10:23 2022] pci_hyperv_intf cec psample igb mlxfw
rc_core dca ahci xhci_pci tls drm i2c_algo_bit libahci
xhci_pci_renesas i2c_piix4
[Thu Nov 10 13:10:23 2022] CPU: 173 PID: 97784 Comm: onnxruntime_tes
Tainted: G W OE 5.13.0-30-generic #33~20.04.1-Ubuntu
[Thu Nov 10 13:10:23 2022] Hardware name: GIGABYTE
G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020
[Thu Nov 10 13:10:23 2022] RIP:
0010:kgd2kfd_schedule_evict_and_restore_process+0x104/0x120 [amdgpu]
[Thu Nov 10 13:10:23 2022] Code: 5e 5d c3 4c 89 e7 e8 cb c6 44 df eb
e7 49 8b 45 60 48 89 ca 48 c7 c7 38 8b d7 c1 48 89 4d e0 8b b0 20 09
00 00 e8 87 ee 7e df <0f> 0b 48 8b 4d e0 eb 9f 41 be ea ff ff ff eb
ba 41 be ed ff ff ff
[Thu Nov 10 13:10:23 2022] RSP: 0018:ffffb25f2a173978 EFLAGS: 00010086
[Thu Nov 10 13:10:23 2022] RAX: 0000000000000000 RBX:
0000000000000001 RCX: 0000000000000027
[Thu Nov 10 13:10:23 2022] RDX: 0000000000000027 RSI:
00000000fffeffff RDI: ffff95d06e4a09c8
[Thu Nov 10 13:10:23 2022] RBP: ffffb25f2a173998 R08:
ffff95d06e4a09c0 R09: ffffb25f2a173750
[Thu Nov 10 13:10:23 2022] R10: 0000000000000001 R11:
0000000000000001 R12: ffff95c371d74580
[Thu Nov 10 13:10:23 2022] R13: ffff95b1cd3f2000 R14:
0000000000000000 R15: ffff95c371d74580
[Thu Nov 10 13:10:23 2022] FS: 00007fcaff268b00(0000)
GS:ffff95d06e480000(0000) knlGS:0000000000000000
[Thu Nov 10 13:10:23 2022] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[Thu Nov 10 13:10:23 2022] CR2: 00007fc643980000 CR3:
00000003e9492000 CR4: 0000000000350ee0
[Thu Nov 10 13:10:23 2022] Call Trace:
[Thu Nov 10 13:10:23 2022] <TASK>
[Thu Nov 10 13:10:23 2022] amdkfd_fence_enable_signaling+0x46/0x50
[amdgpu]
[Thu Nov 10 13:10:23 2022] __dma_fence_enable_signaling+0x52/0xb0
[Thu Nov 10 13:10:23 2022] dma_fence_default_wait+0xa9/0x200
[Thu Nov 10 13:10:23 2022] dma_fence_wait_timeout+0xbd/0xe0
[Thu Nov 10 13:10:23 2022] amddma_resv_wait_timeout+0x6f/0xd0 [amdkcl]
[Thu Nov 10 13:10:23 2022] amdttm_bo_wait+0x39/0x50 [amdttm]
[Thu Nov 10 13:10:23 2022] amdgpu_bo_move+0x41e/0x7b0 [amdgpu]
[Thu Nov 10 13:10:23 2022] ? down_write+0x13/0x50
[Thu Nov 10 13:10:23 2022] ? unmap_mapping_pages+0x68/0x130
[Thu Nov 10 13:10:23 2022] ttm_bo_handle_move_mem+0x7f/0x120 [amdttm]
[Thu Nov 10 13:10:23 2022] amdttm_bo_validate+0xbf/0x100 [amdttm]
[Thu Nov 10 13:10:23 2022] kfd_mem_dmaunmap_attachment+0x131/0x140
[amdgpu]
[Thu Nov 10 13:10:23 2022] unmap_bo_from_gpuvm+0x67/0x80 [amdgpu]
[Thu Nov 10 13:10:23 2022]
amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x114/0x220 [amdgpu]
[Thu Nov 10 13:10:23 2022] ? __mod_memcg_lruvec_state+0x22/0xe0
[Thu Nov 10 13:10:23 2022]
kfd_ioctl_unmap_memory_from_gpu+0xe8/0x270 [amdgpu]
[Thu Nov 10 13:10:23 2022] kfd_ioctl+0x23c/0x590 [amdgpu]
[Thu Nov 10 13:10:23 2022] ?
kfd_ioctl_get_process_apertures_new+0x330/0x330 [amdgpu]
[Thu Nov 10 13:10:23 2022] ? exit_to_user_mode_prepare+0x3d/0x1c0
[Thu Nov 10 13:10:23 2022] ? __fget_files+0xa7/0xd0
[Thu Nov 10 13:10:23 2022] __x64_sys_ioctl+0x91/0xc0
[Thu Nov 10 13:10:23 2022] do_syscall_64+0x61/0xb0
[Thu Nov 10 13:10:23 2022] ? do_syscall_64+0x6e/0xb0
[Thu Nov 10 13:10:23 2022] ? do_syscall_64+0x6e/0xb0
[Thu Nov 10 13:10:23 2022] ? do_syscall_64+0x6e/0xb0
[Thu Nov 10 13:10:23 2022] ? do_syscall_64+0x6e/0xb0
[Thu Nov 10 13:10:23 2022] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[Thu Nov 10 13:10:23 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae
[Thu Nov 10 13:10:23 2022] RIP: 0033:0x7fcaff57b3ab
[Thu Nov 10 13:10:23 2022] Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7
00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa
b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d
00 f7 d8 64 89 01 48
[Thu Nov 10 13:10:23 2022] RSP: 002b:00007fffe41e0098 EFLAGS:
00000206 ORIG_RAX: 0000000000000010
[Thu Nov 10 13:10:23 2022] RAX: ffffffffffffffda RBX:
00007fcacc7f7f80 RCX: 00007fcaff57b3ab
[Thu Nov 10 13:10:23 2022] RDX: 00007fffe41e0120 RSI:
00000000c0184b19 RDI: 0000000000000003
[Thu Nov 10 13:10:23 2022] RBP: 00007fffe41e00d0 R08:
0000562e2d5730d0 R09: 0000000000000000
[Thu Nov 10 13:10:23 2022] R10: 0000562e2c928ec0 R11:
0000000000000206 R12: 0000000000000001
[Thu Nov 10 13:10:23 2022] R13: 00007fffe41e04b0 R14:
0000000000000000 R15: 0000562e2d3f5b20
[Thu Nov 10 13:10:23 2022] </TASK>
[Thu Nov 10 13:10:23 2022] ---[ end trace 1464f08f6be60b30 ]---
Regards,
Eric
On 2023-04-04 10:11, Felix Kuehling wrote:
If we keep the BO in the GTT domain, it means it will not be updated
if we validate it again later in kfd_mem_dmamap_dmabuf. This means
we'll use stale DMA addresses when we update the page tables after
evictions.
I think we'll need to find a different way to avoid triggering the
eviction fence on the original BO when changing the placement of the
DMABuf import here. If you need help brainstorming here, please
share a backtrace from the eviction generated with the
debug_evictions module param.
Regards,
Felix
Am 2023-04-03 um 13:59 schrieb Eric Huang:
dmabuf is allocated/mapped as GTT domain, when dma-unmapping dmabuf
changing placement to CPU will trigger memory eviction after calling
ttm_bo_validate, and the eviction will cause performance drop.
Keeping the correct domain will solve the issue.
Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a3b09edfd1bf..17b708acb447 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -642,7 +642,7 @@ kfd_mem_dmaunmap_dmabuf(struct
kfd_mem_attachment *attachment)
struct ttm_operation_ctx ctx = {.interruptible = true};
struct amdgpu_bo *bo = attachment->bo_va->base.bo;
- amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_CPU);
+ amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_GTT);
ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
}