Re: [PATCH 15/15] RFC: drm/amdgpu: Implement a proper implicit fencing uapi

Daniel Vetter <daniel.vetter@xxxxxxxx> · Wed, 23 Jun 2021 14:18:35 +0200

On Wed, Jun 23, 2021 at 11:45 AM Bas Nieuwenhuizen
<bas@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Jun 22, 2021 at 6:55 PM Daniel Vetter <daniel.vetter@xxxxxxxx> wrote:
> >
> > WARNING: Absolutely untested beyond "gcc isn't dying in agony".
> >
> > Implicit fencing done properly needs to treat the implicit fencing
> > slots like a funny kind of IPC mailbox. In other words it needs to be
> > explicitly. This is the only way it will mesh well with explicit
> > fencing userspace like vk, and it's also the bare minimum required to
> > be able to manage anything else that wants to use the same buffer on
> > multiple engines in parallel, and still be able to share it through
> > implicit sync.
> >
> > amdgpu completely lacks such an uapi. Fix this.
> >
> > Luckily the concept of ignoring implicit fences exists already, and
> > takes care of all the complexities of making sure that non-optional
> > fences (like bo moves) are not ignored. This support was added in
> >
> > commit 177ae09b5d699a5ebd1cafcee78889db968abf54
> > Author: Andres Rodriguez <andresx7@xxxxxxxxx>
> > Date:   Fri Sep 15 20:44:06 2017 -0400
> >
> >     drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2
> >
> > Unfortuantely it's the wrong semantics, because it's a bo flag and
> > disables implicit sync on an allocated buffer completely.
> >
> > We _do_ want implicit sync, but control it explicitly. For this we
> > need a flag on the drm_file, so that a given userspace (like vulkan)
> > can manage the implicit sync slots explicitly. The other side of the
> > pipeline (compositor, other process or just different stage in a media
> > pipeline in the same process) can then either do the same, or fully
> > participate in the implicit sync as implemented by the kernel by
> > default.
> >
> > By building on the existing flag for buffers we avoid any issues with
> > opening up additional security concerns - anything this new flag here
> > allows is already.
> >
> > All drivers which supports this concept of a userspace-specific
> > opt-out of implicit sync have a flag in their CS ioctl, but in reality
> > that turned out to be a bit too inflexible. See the discussion below,
> > let's try to do a bit better for amdgpu.
> >
> > This alone only allows us to completely avoid any stalls due to
> > implicit sync, it does not yet allow us to use implicit sync as a
> > strange form of IPC for sync_file.
> >
> > For that we need two more pieces:
> >
> > - a way to get the current implicit sync fences out of a buffer. Could
> >   be done in a driver ioctl, but everyone needs this, and generally a
> >   dma-buf is involved anyway to establish the sharing. So an ioctl on
> >   the dma-buf makes a ton more sense:
> >
> >   https://lore.kernel.org/dri-devel/20210520190007.534046-4-jason@xxxxxxxxxxxxxx/
> >
> >   Current drivers in upstream solves this by having the opt-out flag
> >   on their CS ioctl. This has the downside that very often the CS
> >   which must actually stall for the implicit fence is run a while
> >   after the implicit fence point was logically sampled per the api
> >   spec (vk passes an explicit syncobj around for that afaiui), and so
> >   results in oversync. Converting the implicit sync fences into a
> >   snap-shot sync_file is actually accurate.
> >
> > - Simillar we need to be able to set the exclusive implicit fence.
> >   Current drivers again do this with a CS ioctl flag, with again the
> >   same problems that the time the CS happens additional dependencies
> >   have been added. An explicit ioctl to only insert a sync_file (while
> >   respecting the rules for how exclusive and shared fence slots must
> >   be update in struct dma_resv) is much better. This is proposed here:
> >
> >   https://lore.kernel.org/dri-devel/20210520190007.534046-5-jason@xxxxxxxxxxxxxx/
> >
> > These three pieces together allow userspace to fully control implicit
> > fencing and remove all unecessary stall points due to them.
> >
> > Well, as much as the implicit fencing model fundamentally allows:
> > There is only one set of fences, you can only choose to sync against
> > only writers (exclusive slot), or everyone. Hence suballocating
> > multiple buffers or anything else like this is fundamentally not
> > possible, and can only be fixed by a proper explicit fencing model.
> >
> > Aside from that caveat this model gets implicit fencing as closely to
> > explicit fencing semantics as possible:
> >
> > On the actual implementation I opted for a simple setparam ioctl, no
> > locking (just atomic reads/writes) for simplicity. There is a nice
> > flag parameter in the VM ioctl which we could use, except:
> > - it's not checked, so userspace likely passes garbage
> > - there's already a comment that userspace _does_ pass garbage in the
> >   priority field
> > So yeah unfortunately this flag parameter for setting vm flags is
> > useless, and we need to hack up a new one.
> >
> > v2: Explain why a new SETPARAM (Jason)
> >
> > v3: Bas noticed I forgot to hook up the dependency-side shortcut. We
> > need both, or this doesn't do much.
> >
> > v4: Rebase over the amdgpu patch to always set the implicit sync
> > fences.
>
> So I think there is still a case missing in this implementation.
> Consider these 3 cases
>
> (format: a->b: b waits on a. Yes, I know arrows are hard)
>
> explicit->explicit: This doesn't wait now, which is good
> Implicit->explicit: This doesn't wait now, which is good
> explicit->implicit : This still waits as the explicit submission still
> adds shared fences and most things that set an exclusive fence for
> implicit sync will hence wait on it.
>
> This is probably good enough for what radv needs now but also sounds
> like a risk wrt baking in new uapi behavior that we don't want to be
> the end result.
>
> Within AMDGPU this is probably solvable in two ways:
>
> 1) Downgrade AMDGPU_SYNC_NE_OWNER to AMDGPU_SYNC_EXPLICIT for shared fences.

I'm not sure that works. I think the right fix is that radeonsi also
switches to this model, with maybe a per-bo CS flag to set indicate
write access, to cut down on the number of ioctls that are needed
otherwise on shared buffers. This per-bo flag would essentially select
between SYNC_NE_OWNER and SYNC_EXPLICIT on a per-buffer basis.

The current amdgpu uapi just doesn't allow any other model without an
explicit opt-in. So current implicit sync userspace just has to
oversync, there's not much choice.

> 2) Have an EXPLICIT fence owner that is used for explicit submissions
> that is ignored by AMDGPU_SYNC_NE_OWNER.
>
> But this doesn't solve cross-driver interactions here.

Yeah cross-driver is still entirely unsolved, because
amdgpu_bo_explicit_sync() on the bo didn't solve that either.
-Daniel

>
> >
> > Cc: mesa-dev@xxxxxxxxxxxxxxxxxxxxx
> > Cc: Bas Nieuwenhuizen <bas@xxxxxxxxxxxxxxxxxxx>
> > Cc: Dave Airlie <airlied@xxxxxxxxx>
> > Cc: Rob Clark <robdclark@xxxxxxxxxxxx>
> > Cc: Kristian H. Kristensen <hoegsberg@xxxxxxxxxx>
> > Cc: Michel Dänzer <michel@xxxxxxxxxxx>
> > Cc: Daniel Stone <daniels@xxxxxxxxxxxxx>
> > Cc: Sumit Semwal <sumit.semwal@xxxxxxxxxx>
> > Cc: "Christian König" <christian.koenig@xxxxxxx>
> > Cc: Alex Deucher <alexander.deucher@xxxxxxx>
> > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
> > Cc: Deepak R Varma <mh12gx2825@xxxxxxxxx>
> > Cc: Chen Li <chenli@xxxxxxxxxxxxx>
> > Cc: Kevin Wang <kevin1.wang@xxxxxxx>
> > Cc: Dennis Li <Dennis.Li@xxxxxxx>
> > Cc: Luben Tuikov <luben.tuikov@xxxxxxx>
> > Cc: linaro-mm-sig@xxxxxxxxxxxxxxxx
> > Signed-off-by: Daniel Vetter <daniel.vetter@xxxxxxxxx>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  7 +++++--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 21 +++++++++++++++++++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  6 ++++++
> >  include/uapi/drm/amdgpu_drm.h           | 10 ++++++++++
> >  4 files changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 65df34c17264..c5386d13eb4a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -498,6 +498,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >         struct amdgpu_bo *gds;
> >         struct amdgpu_bo *gws;
> >         struct amdgpu_bo *oa;
> > +       bool no_implicit_sync = READ_ONCE(fpriv->vm.no_implicit_sync);
> >         int r;
> >
> >         INIT_LIST_HEAD(&p->validated);
> > @@ -577,7 +578,8 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
> >
> >                 e->bo_va = amdgpu_vm_bo_find(vm, bo);
> >
> > -               if (bo->tbo.base.dma_buf && !amdgpu_bo_explicit_sync(bo)) {
> > +               if (bo->tbo.base.dma_buf &&
> > +                   !(no_implicit_sync || amdgpu_bo_explicit_sync(bo))) {
> >                         e->chain = dma_fence_chain_alloc();
> >                         if (!e->chain) {
> >                                 r = -ENOMEM;
> > @@ -649,6 +651,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> >  {
> >         struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
> >         struct amdgpu_bo_list_entry *e;
> > +       bool no_implicit_sync = READ_ONCE(fpriv->vm.no_implicit_sync);
> >         int r;
> >
> >         list_for_each_entry(e, &p->validated, tv.head) {
> > @@ -656,7 +659,7 @@ static int amdgpu_cs_sync_rings(struct amdgpu_cs_parser *p)
> >                 struct dma_resv *resv = bo->tbo.base.resv;
> >                 enum amdgpu_sync_mode sync_mode;
> >
> > -               sync_mode = amdgpu_bo_explicit_sync(bo) ?
> > +               sync_mode = no_implicit_sync || amdgpu_bo_explicit_sync(bo) ?
> >                         AMDGPU_SYNC_EXPLICIT : AMDGPU_SYNC_NE_OWNER;
> >                 r = amdgpu_sync_resv(p->adev, &p->job->sync, resv, sync_mode,
> >                                      &fpriv->vm);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index c080ba15ae77..f982626b5328 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -1724,6 +1724,26 @@ int amdgpu_file_to_fpriv(struct file *filp, struct amdgpu_fpriv **fpriv)
> >         return 0;
> >  }
> >
> > +int amdgpu_setparam_ioctl(struct drm_device *dev, void *data,
> > +                         struct drm_file *filp)
> > +{
> > +       struct drm_amdgpu_setparam *setparam = data;
> > +       struct amdgpu_fpriv *fpriv = filp->driver_priv;
> > +
> > +       switch (setparam->param) {
> > +       case AMDGPU_SETPARAM_NO_IMPLICIT_SYNC:
> > +               if (setparam->value)
> > +                       WRITE_ONCE(fpriv->vm.no_implicit_sync, true);
> > +               else
> > +                       WRITE_ONCE(fpriv->vm.no_implicit_sync, false);
> > +               break;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
> >         DRM_IOCTL_DEF_DRV(AMDGPU_GEM_CREATE, amdgpu_gem_create_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >         DRM_IOCTL_DEF_DRV(AMDGPU_CTX, amdgpu_ctx_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> > @@ -1742,6 +1762,7 @@ const struct drm_ioctl_desc amdgpu_ioctls_kms[] = {
> >         DRM_IOCTL_DEF_DRV(AMDGPU_GEM_VA, amdgpu_gem_va_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >         DRM_IOCTL_DEF_DRV(AMDGPU_GEM_OP, amdgpu_gem_op_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >         DRM_IOCTL_DEF_DRV(AMDGPU_GEM_USERPTR, amdgpu_gem_userptr_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> > +       DRM_IOCTL_DEF_DRV(AMDGPU_SETPARAM, amdgpu_setparam_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
> >  };
> >
> >  static const struct drm_driver amdgpu_kms_driver = {
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> > index ddb85a85cbba..0e8c440c6303 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> > @@ -321,6 +321,12 @@ struct amdgpu_vm {
> >         bool                    bulk_moveable;
> >         /* Flag to indicate if VM is used for compute */
> >         bool                    is_compute_context;
> > +       /*
> > +        * Flag to indicate whether implicit sync should always be skipped on
> > +        * this context. We do not care about races at all, userspace is allowed
> > +        * to shoot itself with implicit sync to its fullest liking.
> > +        */
> > +       bool no_implicit_sync;
> >  };
> >
> >  struct amdgpu_vm_manager {
> > diff --git a/include/uapi/drm/amdgpu_drm.h b/include/uapi/drm/amdgpu_drm.h
> > index 0cbd1540aeac..9eae245c14d6 100644
> > --- a/include/uapi/drm/amdgpu_drm.h
> > +++ b/include/uapi/drm/amdgpu_drm.h
> > @@ -54,6 +54,7 @@ extern "C" {
> >  #define DRM_AMDGPU_VM                  0x13
> >  #define DRM_AMDGPU_FENCE_TO_HANDLE     0x14
> >  #define DRM_AMDGPU_SCHED               0x15
> > +#define DRM_AMDGPU_SETPARAM            0x16
> >
> >  #define DRM_IOCTL_AMDGPU_GEM_CREATE    DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_CREATE, union drm_amdgpu_gem_create)
> >  #define DRM_IOCTL_AMDGPU_GEM_MMAP      DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_GEM_MMAP, union drm_amdgpu_gem_mmap)
> > @@ -71,6 +72,7 @@ extern "C" {
> >  #define DRM_IOCTL_AMDGPU_VM            DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_VM, union drm_amdgpu_vm)
> >  #define DRM_IOCTL_AMDGPU_FENCE_TO_HANDLE DRM_IOWR(DRM_COMMAND_BASE + DRM_AMDGPU_FENCE_TO_HANDLE, union drm_amdgpu_fence_to_handle)
> >  #define DRM_IOCTL_AMDGPU_SCHED         DRM_IOW(DRM_COMMAND_BASE + DRM_AMDGPU_SCHED, union drm_amdgpu_sched)
> > +#define DRM_IOCTL_AMDGPU_SETPARAM      DRM_IOW(DRM_COMMAND_BASE + DRM_AMDGPU_SETPARAM, struct drm_amdgpu_setparam)
> >
> >  /**
> >   * DOC: memory domains
> > @@ -306,6 +308,14 @@ union drm_amdgpu_sched {
> >         struct drm_amdgpu_sched_in in;
> >  };
> >
> > +#define AMDGPU_SETPARAM_NO_IMPLICIT_SYNC       1
> > +
> > +struct drm_amdgpu_setparam {
> > +       /* AMDGPU_SETPARAM_* */
> > +       __u32   param;
> > +       __u32   value;
> > +};
> > +
> >  /*
> >   * This is not a reliable API and you should expect it to fail for any
> >   * number of reasons and have fallback path that do not use userptr to
> > --
> > 2.32.0.rc2
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch