Re: [RFC] amdgpu: Add a context flag to disable implicit sync

Joshua Ashton <joshua@xxxxxxxxx> · Wed, 07 Aug 2024 20:23:13 +0100

I was thinking about this more recently. I was initially considering "maybe this should be a per-BO import," but I couldn't think of anything in the GL model that would actually benefit given its not "true" bindless and there's no update-after-bind there.

Worth others more familiar with GL asking that question to themselves also. I am definitely not totally up on what's possible there.

Overall, I think I am OK with this approach, even though I think mixing implicit and explicit sync is gross, and I want the pain that is implicit sync to just go away forever. :-)

- Joshie 🐸✨

On August 7, 2024 4:39:32 PM GMT+01:00, Faith Ekstrand <faith@xxxxxxxxxxxxx> wrote:
>Previously, AMDGPU_GEM_CREATE_EXPLICIT_SYNC was used to disable implicit
>synchronization on BOs when explicit synchronization can be used.  The
>problem is that this flag is per-BO and affects all amdgpu users in the
>system, not just the usermode drver which sets it.  This can lead to
>some unintended consequences for userspace if not used carefully.
>
>Since the introduction of DMA_BUF_IOCTL_EXPORT_SYNC_FILE and
>DMA_BUF_IOCTL_IMPORT_SYNC_FILE, many userspace window system components
>have grown the ability to convert between the Vulkan explicit sync model
>and the legacy implicit sync model used by X11 and Wayland in the past.
>This allows both old and new components to exist simultaneously and talk
>to each other.  In particular, XWayland is able to convert between the
>two to let Vulkan apps work seamlessly with older X11 compositors that
>aren't aware of explicit synchronizaiton.  This is rapidly becoming the
>backbone of synchronization in the Linux window system space.
>
>Unfortunately, AMDGPU_GEM_CREATE_EXPLICIT_SYNC breaks this because it
>disables all implicit synchronization on the given BO, regardless of
>which userspace driver is rendering with it and regardless of the
>assumptions made by the client application.  In particular, this is
>causing issues with RADV and radeonsi.  RADV sets the flag to disable
>implicit sync on window system buffers so that it can safely have them
>resident at all times without causing internal over-synchronization.
>The BO is then handed off to a compositor which composites using
>radeonsi.  If the compositor uses the EGL_ANDROID_native_fence_sync
>extension to pass explicit sync files through to radeonsi, then
>everything is fine.  However, if that buffer ever gets handed to an
>instance of radeonsi with any assumption of implicit synchronization,
>radeonsi won't be able sync on the BO because RADV disabled implict sync
>on that BO system-wide.  It doesn't matter whether some window system
>component used DMA_BUF_IOCTL_IMPORT_SYNC_FILE to set the appropriate
>fence on the BO, amdgpu will ignore it.
>
>This new flag disables implicit sync on the context rather than the BO.
>This way RADV can disable implicit sync (which is what RADV has always
>wanted) without affecting other components in the system.  If RADV (or
>some other driver) wants implicit sync on some BO, it can use
>DMA_BUF_IOCTL_EXPORT_SYNC_FILE and DMA_BUF_IOCTL_IMPORT_SYNC_FILE to
>manually synchronize with other implicit-sync components.  This is the
>approach we've taken with NVK/nouveau, ANV/xe, and similar to the
>approach taken by ANV/i915 and it works well for those drivers.
>
>Ideally, I would like to see something like this back-ported to at least
>the kernel where DMA_BUF_IOCTL_IMPORT/EXPORT_SYNC_FILE were introduced
>so that we don't have to wait another year for the fix to reach users.
>However, I understand that back-porting UAPI is problematic and I'll
>leave that decision up to the amdgpu maintainers.  Michel suggested that
>a new CTX_OP would make more sense if we want to back-port it but the
>create flag made more sense to me from an API design PoV.
>
>Signed-off-by: Faith Ekstrand <faith.ekstrand@xxxxxxxxxxxxx>
>Cc: Alex Deucher <alexander.deucher@xxxxxxx>
>Cc: Christian König <christian.koenig@xxxxxxx>
>Cc: David Airlie <airlied@xxxxxxxxx>
>Cc: Michel Dänzer <mdaenzer@xxxxxxxxxx>
>Cc: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  3 ++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 12 ++++++++----
> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  7 +++++++
> include/uapi/drm/amdgpu_drm.h           | 12 +++++++++++-
> 4 files changed, 28 insertions(+), 6 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>index ec888fc6ead8..8410b4426541 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>@@ -1196,7 +1196,8 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> 		struct dma_resv *resv = bo->tbo.base.resv;
> 		enum amdgpu_sync_mode sync_mode;
> 
>-		sync_mode = amdgpu_bo_explicit_sync(bo) ?
>+		sync_mode = (amdgpu_ctx_explicit_sync(p->ctx) ||
>+			     amdgpu_bo_explicit_sync(bo)) ?
> 			AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
> 		r = amdgpu_sync_resv(p->adev, &p->sync, resv, sync_mode,
> 				     &fpriv->vm);
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>index 5cb33ac99f70..a304740ccedf 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>@@ -318,7 +318,8 @@ static int amdgpu_ctx_get_stable_pstate(struct amdgpu_ctx *ctx,
> }
> 
> static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t priority,
>-			   struct drm_file *filp, struct amdgpu_ctx *ctx)
>+			   uint32_t flags, struct drm_file *filp,
>+			   struct amdgpu_ctx *ctx)
> {
> 	struct amdgpu_fpriv *fpriv = filp->driver_priv;
> 	u32 current_stable_pstate;
>@@ -334,6 +335,7 @@ static int amdgpu_ctx_init(struct amdgpu_ctx_mgr *mgr, int32_t priority,
> 	ctx->mgr = mgr;
> 	spin_lock_init(&ctx->ring_lock);
> 
>+	ctx->flags = flags;
> 	ctx->reset_counter = atomic_read(&mgr->adev->gpu_reset_counter);
> 	ctx->reset_counter_query = ctx->reset_counter;
> 	ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
>@@ -474,6 +476,7 @@ static int amdgpu_ctx_alloc(struct amdgpu_device *adev,
> 			    struct amdgpu_fpriv *fpriv,
> 			    struct drm_file *filp,
> 			    int32_t priority,
>+			    uint32_t flags,
> 			    uint32_t *id)
> {
> 	struct amdgpu_ctx_mgr *mgr = &fpriv->ctx_mgr;
>@@ -493,7 +496,7 @@ static int amdgpu_ctx_alloc(struct amdgpu_device *adev,
> 	}
> 
> 	*id = (uint32_t)r;
>-	r = amdgpu_ctx_init(mgr, priority, filp, ctx);
>+	r = amdgpu_ctx_init(mgr, priority, flags, filp, ctx);
> 	if (r) {
> 		idr_remove(&mgr->ctx_handles, *id);
> 		*id = 0;
>@@ -666,7 +669,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
> 		     struct drm_file *filp)
> {
> 	int r;
>-	uint32_t id, stable_pstate;
>+	uint32_t id, stable_pstate, flags;
> 	int32_t priority;
> 
> 	union drm_amdgpu_ctx *args = data;
>@@ -675,6 +678,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
> 
> 	id = args->in.ctx_id;
> 	priority = args->in.priority;
>+	flags = args->in.flags;
> 
> 	/* For backwards compatibility, we need to accept ioctls with garbage
> 	 * in the priority field. Garbage values in the priority field, result
>@@ -685,7 +689,7 @@ int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
> 
> 	switch (args->in.op) {
> 	case AMDGPU_CTX_OP_ALLOC_CTX:
>-		r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, &id);
>+		r = amdgpu_ctx_alloc(adev, fpriv, filp, priority, flags, &id);
> 		args->out.alloc.ctx_id = id;
> 		break;
> 	case AMDGPU_CTX_OP_FREE_CTX:
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>index 85376baaa92f..9431c8d2ea59 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h
>@@ -45,6 +45,7 @@ struct amdgpu_ctx_entity {
> struct amdgpu_ctx {
> 	struct kref			refcount;
> 	struct amdgpu_ctx_mgr		*mgr;
>+	uint32_t			flags;
> 	unsigned			reset_counter;
> 	unsigned			reset_counter_query;
> 	uint64_t			generation;
>@@ -84,6 +85,12 @@ struct dma_fence *amdgpu_ctx_get_fence(struct amdgpu_ctx *ctx,
> bool amdgpu_ctx_priority_is_valid(int32_t ctx_prio);
> void amdgpu_ctx_priority_override(struct amdgpu_ctx *ctx, int32_t ctx_prio);
> 
>+static inline bool
>+amdgpu_ctx_explicit_sync(struct amdgpu_ctx *ctx)
>+{
>+	return ctx->flags & AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC;
>+}
>+
> int amdgpu_ctx_ioctl(struct drm_device *dev, void *data,
> 		     struct drm_file *filp);
> 
>diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
>index 96e32dafd4f0..e9d87a6e3d86 100644
>--- a/include/uapi/drm/amdgpu_drm.h
>+++ b/include/uapi/drm/amdgpu_drm.h
>@@ -125,7 +125,14 @@ extern "C" {
> #define AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS	(1 << 5)
> /* Flag that BO is always valid in this VM */
> #define AMDGPU_GEM_CREATE_VM_ALWAYS_VALID	(1 << 6)
>-/* Flag that BO sharing will be explicitly synchronized */
>+/* Flag that BO sharing will be explicitly synchronized
>+ *
>+ * This flag should not be used unless the client can guarantee that no
>+ * other driver which ever touches this BO will ever want to use implicit
>+ * synchronization as it disables implicit sync on this BO system-wide.
>+ * Instead, drivers which use an explicit synchronization model should
>+ * prefer AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC.
>+ */
> #define AMDGPU_GEM_CREATE_EXPLICIT_SYNC		(1 << 7)
> /* Flag that indicates allocating MQD gart on GFX9, where the mtype
>  * for the second page onward should be set to NC. It should never
>@@ -240,6 +247,9 @@ union drm_amdgpu_bo_list {
> #define AMDGPU_CTX_OP_GET_STABLE_PSTATE	5
> #define AMDGPU_CTX_OP_SET_STABLE_PSTATE	6
> 
>+/* indicate that all synchronization will be explicit */
>+#define AMDGPU_CTX_ALLOC_FLAGS_EXPLICIT_SYNC (1<<0)
>+
> /* GPU reset status */
> #define AMDGPU_CTX_NO_RESET		0
> /* this the context caused it */