On Sat, Apr 16, 2022 at 8:43 PM Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, Apr 6, 2022 at 4:25 AM Bas Nieuwenhuizen > <bas@xxxxxxxxxxxxxxxxxxx> wrote: > > > > On Mon, Apr 4, 2022 at 8:43 AM Christian König <christian.koenig@xxxxxxx> wrote: > > > > > > Hi Bas, > > > > > > Am 04.04.22 um 03:32 schrieb Bas Nieuwenhuizen: > > > > Hi Christian, > > > > > > > > Couple of concerns here: > > > > > > > > 1) This is missing the option to wait on a syncobj before executing the VA op. > > > > > > Uff, that was not really planned in any way. > > > > > > Currently all VM operations are scheduled to run behind all CS > > > operations and that is not easily changeable. > > > > > > In other words we planned that disable the VM->CS implicit sync, but not > > > CS->VM. > > > > > > > 2) There are no mechanisms to disable implicit sync yet. > > > > > > Specifying a sync_obj will do that. > > > > > > > 3) in radv we typically signal multiple syncobj, so it would be nice > > > > if we could have that capability here too. Technically we can use the > > > > transfer ioctls, but it is kinda annoying. > > > > > > That shouldn't be much of a problem. > > Update: I don't really think this would help, and is not a big issue > anyway. The overhead is only 1-5% on the remap operation and the > alternative doesn't provide the savings I expected. > > > > > > > I had https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBNieuwenhuizen%2Flinux%2Fcommit%2Fd8a1cce0c4c5c87522ae8866faf4f67be7189ef0&data=04%7C01%7Cchristian.koenig%40amd.com%7C6d4dec03a52b4ce510b208da15daf2fa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637846327828400310%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=K6J%2FHb2NAL8NpoTkaqmpjq4xJS9ozpu2UJBZwSA8OIk%3D&reserved=0 > > > > + radv implementation before we ended up in the situation of waits not > > > > being a doable thing. > > > > > > Well I changed the TLB handling so that waits are doable now :) > > > > > > > > > > > For (1) we can emulate in userspace by waiting for all syncobj to > > > > signal first, but I'm concerned that will result in GPU bubbles due to > > > > CPU work. To test this out I'm starting to hook up more implicit sync > > > > disabling stuff (starting with the submissions themselves, WIP at > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FBNieuwenhuizen%2Flinux%2Ftree%2Fno-implicit-sync&data=04%7C01%7Cchristian.koenig%40amd.com%7C6d4dec03a52b4ce510b208da15daf2fa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637846327828400310%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=taf3fjRYd2OT9GR%2FgtBsCcoIOtOW0pYjdcsGAe%2BnJSw%3D&reserved=0), which > > > > is why you're seeing some random comments on your dma resv usage > > > > series coming your way. > > > > > > Which is rather welcomed. That patch set certainly needs more eyes on it. > > > > > > Thanks, > > > Christian. > > > > > > > > > > > - Bas > > > > > > > > > > > > On Thu, Mar 31, 2022 at 11:47 AM Christian König > > > > <ckoenig.leichtzumerken@xxxxxxxxx> wrote: > > > >> To improve synchronization of command submissions with page table updates RADV > > > >> wants to manually wait for the updates to be completed without affecting > > > >> parallel submissions. > > > >> > > > >> Implement this by allowing to specify a drm_sync_obj handle and a timeline > > > >> point for the GEM_VA IOCTL. > > > >> > > > >> Signed-off-by: Christian König <christian.koenig@xxxxxxx> > > > >> --- > > > >> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 79 ++++++++++++++++++++----- > > > >> include/uapi/drm/amdgpu_drm.h | 5 +- > > > >> 2 files changed, 67 insertions(+), 17 deletions(-) > > > >> > > > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > > > >> index 9cdfee67efeb..bf0092f629f9 100644 > > > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > > > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c > > > >> @@ -33,6 +33,7 @@ > > > >> > > > >> #include <drm/amdgpu_drm.h> > > > >> #include <drm/drm_drv.h> > > > >> +#include <drm/drm_syncobj.h> > > > >> #include <drm/drm_gem_ttm_helper.h> > > > >> > > > >> #include "amdgpu.h" > > > >> @@ -598,6 +599,7 @@ int amdgpu_gem_metadata_ioctl(struct drm_device *dev, void *data, > > > >> * @vm: vm to update > > > >> * @bo_va: bo_va to update > > > >> * @operation: map, unmap or clear > > > >> + * @last_update: optional pointer to a dma_fence for the last VM update > > > >> * > > > >> * Update the bo_va directly after setting its address. Errors are not > > > >> * vital here, so they are not reported back to userspace. > > > >> @@ -605,20 +607,21 @@ int amdgpu_gem_metadata_ioctl(struct drm_device *dev, void *data, > > > >> static void amdgpu_gem_va_update_vm(struct amdgpu_device *adev, > > > >> struct amdgpu_vm *vm, > > > >> struct amdgpu_bo_va *bo_va, > > > >> - uint32_t operation) > > > >> + uint32_t operation, > > > >> + struct dma_fence **last_update) > > > >> { > > > >> int r; > > > >> > > > >> if (!amdgpu_vm_ready(vm)) > > > >> return; > > > >> > > > >> - r = amdgpu_vm_clear_freed(adev, vm, NULL); > > > >> + r = amdgpu_vm_clear_freed(adev, vm, last_update); > > > >> if (r) > > > >> goto error; > > > >> > > > >> if (operation == AMDGPU_VA_OP_MAP || > > > >> operation == AMDGPU_VA_OP_REPLACE) { > > > >> - r = amdgpu_vm_bo_update(adev, bo_va, false, NULL); > > > >> + r = amdgpu_vm_bo_update(adev, bo_va, false, last_update); > > > >> if (r) > > > >> goto error; > > Two further issues here: > > 1) We're missing the fence in the pde update: > https://github.com/BNieuwenhuizen/linux/commit/0762af558151136a74681944ccb2d5e815f5cc57 > > 2) in a replace, we're not emitting the shrunk mappings of the old > bo_va (e.g. the prt_va) in the ioctl, which ends up doing a mapping > update at CS time with implicit sync. I needed > https://github.com/BNieuwenhuizen/linux/commit/fe2559529f5ce3cafa4efb40b025d3deb8beab68 > to avoid this, but I'm not sure about the lack of reservation story > here (especially given you called something that looked like this > problem very deadlock prone in one of our earlier threads). also forgot: How do we trigger TLB flushes now? I don't think we update sync->last_vm_update anymore for these explicit fences, no? > > > > > >> } > > > >> @@ -671,6 +674,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, > > > >> struct drm_gem_object *gobj; > > > >> struct amdgpu_device *adev = drm_to_adev(dev); > > > >> struct amdgpu_fpriv *fpriv = filp->driver_priv; > > > >> + struct dma_fence *fence = dma_fence_get_stub(); > > > >> + struct dma_fence_chain *chain = NULL; > > > >> + struct drm_syncobj *syncobj = NULL; > > > >> struct amdgpu_bo *abo; > > > >> struct amdgpu_bo_va *bo_va; > > > >> struct amdgpu_bo_list_entry vm_pd; > > > >> @@ -714,17 +720,9 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, > > > >> return -EINVAL; > > > >> } > > > >> > > > >> - switch (args->operation) { > > > >> - case AMDGPU_VA_OP_MAP: > > > >> - case AMDGPU_VA_OP_UNMAP: > > > >> - case AMDGPU_VA_OP_CLEAR: > > > >> - case AMDGPU_VA_OP_REPLACE: > > > >> - break; > > > >> - default: > > > >> - dev_dbg(dev->dev, "unsupported operation %d\n", > > > >> - args->operation); > > > >> - return -EINVAL; > > > >> - } > > > >> + /* For debugging delay all VM updates till CS time */ > > > >> + if (!amdgpu_vm_debug) > > > >> + args->flags |= AMDGPU_VM_DELAY_UPDATE; > > > > FWIW in my testing syncobj + AMDGPU_VM_DELAY_UPDATE seems to make the > > explicit sync not work (because we don't execute the map operations, > > and then the delayed operations don't know about the syncobj). So > > besides deleting this test code, should we reject > > AMDGPU_VM_DELAY_UPDATE + syncobj? > > > > With that I've had success getting some overlap (e.g. > > https://drive.google.com/file/d/1ER1fmx6jmZi1yHDyn0gefitmqHUGM0iL/view?usp=sharing), > > though I still got a bit more than expected implicit synchronization > > going on due to creation of new buffers and their initial map > > operations. Though that seems to be squarely a radv problem :) > > > > As a side note, are MAP/UNMAP operations without AMDGPU_VM_PAGE_PRT > > valid or do we need to guard for them? > > > > > > > >> > > > >> INIT_LIST_HEAD(&list); > > > >> INIT_LIST_HEAD(&duplicates); > > > >> @@ -763,6 +761,30 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, > > > >> bo_va = NULL; > > > >> } > > > >> > > > >> + if (args->syncobj) { > > > >> + syncobj = drm_syncobj_find(filp, args->syncobj); > > > >> + if (!syncobj) { > > > >> + r = -EINVAL; > > > >> + goto error_backoff; > > > >> + } > > > >> + > > > >> + if (args->timeline_point) { > > > >> + chain = dma_fence_chain_alloc(); > > > >> + if (!chain) { > > > >> + r = -ENOMEM; > > > >> + goto error_put_syncobj; > > > >> + } > > > >> + } > > > >> + > > > >> + /* > > > >> + * Update the VM once before to make sure there are no other > > > >> + * pending updates. > > > >> + */ > > > >> + if (!(args->flags & AMDGPU_VM_DELAY_UPDATE)) > > > >> + amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va, > > > >> + args->operation, NULL); > > > >> + } > > > >> + > > > >> switch (args->operation) { > > > >> case AMDGPU_VA_OP_MAP: > > > >> va_flags = amdgpu_gem_va_map_flags(adev, args->flags); > > > >> @@ -786,17 +808,42 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, > > > >> va_flags); > > > >> break; > > > >> default: > > > >> + dev_dbg(dev->dev, "unsupported operation %d\n", > > > >> + args->operation); > > > >> + r = -EINVAL; > > > >> break; > > > >> } > > > >> - if (!r && !(args->flags & AMDGPU_VM_DELAY_UPDATE) && !amdgpu_vm_debug) > > > >> + if (r) > > > >> + goto error_free_chain; > > > >> + > > > >> + if (!(args->flags & AMDGPU_VM_DELAY_UPDATE)) > > > >> amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va, > > > >> - args->operation); > > > >> + args->operation, syncobj ? > > > >> + &fence : NULL); > > > >> + > > > >> + if (syncobj) { > > > >> + if (chain) { > > > >> + drm_syncobj_add_point(syncobj, chain, fence, > > > >> + args->timeline_point); > > > >> + chain = NULL; > > > >> + } else { > > > >> + drm_syncobj_replace_fence(syncobj, fence); > > > >> + } > > > >> + } > > > >> + > > > >> +error_free_chain: > > > >> + dma_fence_chain_free(chain); > > > >> + > > > >> +error_put_syncobj: > > > >> + if (syncobj) > > > >> + drm_syncobj_put(syncobj); > > > >> > > > >> error_backoff: > > > >> ttm_eu_backoff_reservation(&ticket, &list); > > > >> > > > >> error_unref: > > > >> drm_gem_object_put(gobj); > > > >> + dma_fence_put(fence); > > > >> return r; > > > >> } > > > >> > > > >> diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h > > > >> index 1d65c1fbc4ec..f84b5f2c817c 100644 > > > >> --- a/include/uapi/drm/amdgpu_drm.h > > > >> +++ b/include/uapi/drm/amdgpu_drm.h > > > >> @@ -533,7 +533,8 @@ struct drm_amdgpu_gem_op { > > > >> struct drm_amdgpu_gem_va { > > > >> /** GEM object handle */ > > > >> __u32 handle; > > > >> - __u32 _pad; > > > >> + /** Optional DRM Syncobj to signal when operation completes */ > > > >> + __u32 syncobj; > > > >> /** AMDGPU_VA_OP_* */ > > > >> __u32 operation; > > > >> /** AMDGPU_VM_PAGE_* */ > > > >> @@ -544,6 +545,8 @@ struct drm_amdgpu_gem_va { > > > >> __u64 offset_in_bo; > > > >> /** Specify mapping size. Must be correctly aligned. */ > > > >> __u64 map_size; > > > >> + /** Optional Syncobj timeline point to signal */ > > > >> + __u64 timeline_point; > > > >> }; > > > >> > > > >> #define AMDGPU_HW_IP_GFX 0 > > > >> -- > > > >> 2.25.1 > > > >> > > >