On 2021-05-26 5:25 p.m., Felix Kuehling
wrote:
Am 2021-05-26 um 3:21 p.m. schrieb Eric Huang:On 2021-05-25 3:16 p.m., Felix Kuehling wrote:Similar to a recent fix by Philip Yang 76e08b37d0aa ("drm/amdgpu: flush TLB if valid PDE turns into PTE"), there needs to be a conditional TLB flush after map, if any PDEs were unmapped and turned into PTEs in the process. This is currently returned by amdgpu_vm_bo_update_mapping in the "table_freed" parameter. This needs to be also returned by amdgpu_vm_bo_update and reported back to KFD, so KFD can do the TLB flush after map, if needed.I follow up your suggestion to create another patch (attached) and test it. It seems it doesn't improve the latency when memory size is bigger than huge page (2M), because table_freed parameter will always be true when mapping page is huge page size. I think Philip's patch is to fix the case of remapping memory from small page to huge page in HMM, but it doesn't consider if the memory is remapped and arbitrarily flushes TLBs when mapping huge page.That's unexpected. Turning an invalid PDE into a valid (huge) PTE should not trigger a TLB flush.
table_freed will be true if PDE has been used by previous mapping, unmap the previous mapping will clear the PTEs, leave PDE unchanged as P=0, V=1 (in memory and TLB), then huge page mapping turns PDE to PTE (P=1, V=1) in memory, and free PTE page.
For example, test map 0x7ffe37401000, unmap it, and then map 0x7ffe3740000 2MB huge page, table_freed will be true, means that flush TLB is needed after mapping huge page.
You can change the test, don't unmap previous mapping, then 2MB huge page will get new GPU virtual address, or closeKFD, openKFD again to create new GPU vm.
Regards,
Philip
Regards, Felixkfd_flush_tlb probably needs a new parameter to determine the flush type. The flush after map can be a "legacy" flush (type 0). The flush after unmap must be a "heavy-weight" flush (type 2) to make sure we don't evict cache lines into pages that we no longer own. Finally, in the ticket I thought about possible optimizations using a worker to minimize the impact of TLB flushes on unmap latency. That could be a follow up commit.It is a good idea to use worker, but how do we grantee it done before memory is remapped? if remapping depends on it, then more latency will be introduced in map. Regards, EricRegards, Felix Am 2021-05-25 um 1:53 p.m. schrieb Eric Huang:It it to optimize memory allocation latency. Signed-off-by: Eric Huang <jinhuieric.huang@xxxxxxx> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 960913a35ee4..ab73741edb97 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -1657,20 +1657,6 @@ static int kfd_ioctl_map_memory_to_gpu(struct file *filep, goto sync_memory_failed; } - /* Flush TLBs after waiting for the page table updates to complete */ - for (i = 0; i < args->n_devices; i++) { - peer = kfd_device_by_id(devices_arr[i]); - if (WARN_ON_ONCE(!peer)) - continue; - peer_pdd = kfd_get_process_device_data(peer, p); - if (WARN_ON_ONCE(!peer_pdd)) - continue; - if (!amdgpu_read_lock(peer->ddev, true)) { - kfd_flush_tlb(peer_pdd); - amdgpu_read_unlock(peer->ddev); - } - } - kfree(devices_arr); trace_kfd_map_memory_to_gpu_end(p, @@ -1766,6 +1752,7 @@ static int kfd_ioctl_unmap_memory_from_gpu(struct file *filep, amdgpu_read_unlock(peer->ddev); goto unmap_memory_from_gpu_failed; } + kfd_flush_tlb(peer_pdd); amdgpu_read_unlock(peer->ddev); args->n_success = i+1; } _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://nam11.safelinks.protection.outlook.com/?url="">_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://nam11.safelinks.protection.outlook.com/?url="">
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx