[PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished

deathsimple@xxxxxxxxxxx (Christian König) · Wed, 3 May 2017 11:23:05 +0200

> Please give me some detail approach on dma_fence, thanks !
>
I think the idea of using the fence_context is even better than using 
the fence_status. It is already present for a while and filtering the 
jobs/entities by it needs something like 10 lines of code.

The fence status is new and only available on newer kernels (we would 
need to rebase to amd-staging-4.11).

I will try to dig up later today the how the Intel guys solved that problem.

Christian.

Am 03.05.2017 um 11:14 schrieb Liu, Monk:
>
> I thought of that for the first place as well, that keep an ID and 
> parsing the ID with every job not signaled to see if the job
>
> Belongs to something guilty , But keep ctx alive is more simple so I 
> choose this way.
>
> But I admit it is doable as well,  and I want to compare this method 
> and the dma_fence status filed you mentioned,
>
> Please give me some detail approach on dma_fence, thanks !
>
> BR Monk
>
> *From:*Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
> *Sent:* Wednesday, May 03, 2017 5:12 PM
> *To:* Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liu, Monk 
> <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished
>
>     > 1) fence_context have chance to incorrectly represent the context
>     behind it, because the number can be used up and will wrapped
>     around from beginning
>
>     No, fence_context is unique in global in kernel.
>
> Yeah, I would go with that as well.
>
> You note the fence_context of the job which caused the trouble.
>
> Then take a look at all jobs on the recovery list and remove the ones 
> with the same fence_context number.
>
> Then take a look at all entities on the runlist and remove the one 
> with the same fence_context number. If it is already freed you won't 
> find any, but that shouldn't be a problem.
>
> This way you effectively can prevent other jobs from the same context 
> from running and the context query can simply check if the entity is 
> still on the runlist to figure out if it was guilty or not.
>
> Regards,
> Christian.
>
> Am 03.05.2017 um 09:23 schrieb zhoucm1:
>
>     On 2017å¹´05æ??03æ?¥ 14:02, Liu, Monk wrote:
>
>         You can add ctx as filed of job, but not get reference of it,
>         when you try to use ctx, just check if ctx == NULL.
>         > that doesn't work at all... job->ctx will always be non-NULL
>         after it is initialized, you just refer to a wild pointer
>         after CTX released
>
>     job->ctx is a **ctx, which could resolve your this concern.
>
>
>
>         Another stupid method:
>         Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
>         job->fence->fence_context with
>         ctx->ring[].entity->fence_context. if not found, then ctx is
>         freed, otherwise you can do your things for this ctx.
>         > 1) fence_context have chance to incorrectly represent the
>         context behind it, because the number can be used up and will
>         wrapped around from beginning
>
>     No, fence_context is unique in global in kernel.
>
>     Regards,
>     David Zhou
>
>            2) why not just keep CTX alive, which is much simple than
>         this method
>
>         BR
>
>         -----Original Message-----
>         From: Zhou, David(ChunMing)
>         Sent: Wednesday, May 03, 2017 12:54 PM
>         To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>         Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>         Christian KÃ¶nig <deathsimple at vodafone.de>
>         <mailto:deathsimple at vodafone.de>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         You can add ctx as filed of job, but not get reference of it,
>         when you try to use ctx, just check if ctx == NULL.
>
>         Another stupid method:
>         Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the
>         job->fence->fence_context with
>         ctx->ring[].entity->fence_context. if not found, then ctx is
>         freed, otherwise you can do your things for this ctx.
>
>         Regards,
>         David Zhou
>
>         -----Original Message-----
>         From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org]
>         On Behalf Of Liu, Monk
>         Sent: Wednesday, May 03, 2017 11:57 AM
>         To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>         Christian KÃ¶nig <deathsimple at vodafone.de>
>         <mailto:deathsimple at vodafone.de>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         @Christian,
>
>         The thing is I need mark all entities behind this timeout job
>         as guilty, so I use a member in entity named "ptr_guilty", to
>         point to The member in amdgpu_ctx named "guilty", that way
>         read *entity->ptr_guilty can get you if this entity is
>         "invalid", and set *entity->ptr_guilty Can let you set all
>         entities belongs to the context as "invalid".
>
>         Above logic is to guarantee we can kick out all guilty
>         entities in entity kfifo, and also block all IOCTL with a ctx
>         handle point to this guilty context, And we only recovery
>         other jobs/entities/context after scheduler unparked
>
>         If you reject the patch to keep ctx alive till all jobs
>         signaled, please give me a solution to satisfy above logic
>
>         BR Monk
>
>         -----Original Message-----
>         From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org]
>         On Behalf Of Liu, Monk
>         Sent: Wednesday, May 03, 2017 11:31 AM
>         To: Christian KÃ¶nig <deathsimple at vodafone.de>
>         <mailto:deathsimple at vodafone.de>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         1, This is necessary otherwise how can I access entity pointer
>         after a job timedout , and why it is dangerous ?
>         2, what's the status field in the fences you were referring to
>         ? I need to judge if it could satisfy my requirement
>
>
>
>         -----Original Message-----
>         From: Christian KÃ¶nig [mailto:deathsimple at vodafone.de]
>         Sent: Monday, May 01, 2017 10:48 PM
>         To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all
>         job finished
>
>         Am 01.05.2017 um 09:22 schrieb Monk Liu:
>         > for TDR guilty context feature, we need access ctx/s_entity
>         field
>         > member through sched job pointer,so ctx must keep alive till
>         all job
>         > from it signaled.
>
>         NAK, that is unnecessary and quite dangerous.
>
>         Instead we have the designed status field in the fences which
>         should be checked for that.
>
>         Regards,
>         Christian.
>
>         >
>         > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43
>         > Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>         <mailto:Monk.Liu at amd.com>
>         > ---
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu.h           | 6 +++++-
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        | 2 +-
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c       | 9 +++++++++
>         >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       | 9 +++++++--
>         >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------
>         >   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 +
>         >   6 files changed, 23 insertions(+), 10 deletions(-)
>         >
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > index e330009..8e031d6 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>         > @@ -760,10 +760,12 @@ struct amdgpu_ib {
>         >        uint32_t                        flags;
>         >   };
>         >
>         > +struct amdgpu_ctx;
>         > +
>         >   extern const struct amd_sched_backend_ops amdgpu_sched_ops;
>         >
>         >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>         num_ibs,
>         > -                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm);
>         > +                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm, struct
>         > +amdgpu_ctx *ctx);
>         >   int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev,
>         unsigned size,
>         >                             struct amdgpu_job **job);
>         >
>         > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr {
>         >
>         >   struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv
>         *fpriv, uint32_t id);
>         >   int amdgpu_ctx_put(struct amdgpu_ctx *ctx);
>         > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx);
>         >
>         >   uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx,
>         struct amdgpu_ring *ring,
>         >                              struct fence *fence);
>         > @@ -1129,6 +1132,7 @@ struct amdgpu_job {
>         >        struct amdgpu_sync      sync;
>         >        struct amdgpu_ib        *ibs;
>         >        struct fence            *fence; /* the hw fence */
>         > +     struct amdgpu_ctx *ctx;
>         >        uint32_t                preamble_status;
>         >        uint32_t                num_ibs;
>         >        void                    *owner;
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > index 699f5fe..267fb65 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>         > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct
>         amdgpu_cs_parser *p, void *data)
>         >                }
>         >        }
>         >
>         > -     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm);
>         > +     ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm,
>         p->ctx);
>         >        if (ret)
>         >                goto free_all_kdata;
>         >
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > index b4bbbb3..81438af 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>         > @@ -25,6 +25,13 @@
>         >   #include <drm/drmP.h>
>         >   #include "amdgpu.h"
>         >
>         > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx
>         *ctx) {
>         > +     if (ctx)
>         > +             kref_get(&ctx->refcount);
>         > +     return ctx;
>         > +}
>         > +
>         >   static int amdgpu_ctx_init(struct amdgpu_device *adev,
>         struct amdgpu_ctx *ctx)
>         >   {
>         >        unsigned i, j;
>         > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct
>         amdgpu_device *adev, struct amdgpu_ctx *ctx)
>         >                                          rq, amdgpu_sched_jobs);
>         >                if (r)
>         >                        goto failed;
>         > +
>         > +             ctx->rings[i].entity.ptr_guilty =
>         &ctx->guilty; /* kernel entity
>         > +doesn't have ptr_guilty */
>         >        }
>         >
>         >        return 0;
>         > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > index 690ef3d..208da11 100644
>         > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>         > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct
>         amd_sched_job *s_job)
>         >   }
>         >
>         >   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned
>         num_ibs,
>         > -                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm)
>         > +                  struct amdgpu_job **job, struct amdgpu_vm
>         *vm, struct
>         > +amdgpu_ctx *ctx)
>         >   {
>         >        size_t size = sizeof(struct amdgpu_job);
>         >
>         > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device
>         *adev, unsigned num_ibs,
>         >        (*job)->vm = vm;
>         >        (*job)->ibs = (void *)&(*job)[1];
>         >        (*job)->num_ibs = num_ibs;
>         > +     (*job)->ctx = amdgpu_ctx_kref_get(ctx);
>         >
>         >        amdgpu_sync_create(&(*job)->sync);
>         >
>         > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct
>         amdgpu_device *adev, unsigned size,
>         >   {
>         >        int r;
>         >
>         > -     r = amdgpu_job_alloc(adev, 1, job, NULL);
>         > +     r = amdgpu_job_alloc(adev, 1, job, NULL, NULL);
>         >        if (r)
>         >                return r;
>         >
>         > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct
>         amdgpu_job *job)
>         >   static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
>         >   {
>         >        struct amdgpu_job *job = container_of(s_job, struct
>         amdgpu_job,
>         > base);
>         > +     struct amdgpu_ctx *ctx = job->ctx;
>         > +
>         > +     if (ctx)
>         > +             amdgpu_ctx_put(ctx);
>         >
>         >        fence_put(job->fence);
>         >        amdgpu_sync_free(&job->sync);
>         > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > index 6f4e31f..9100ca8 100644
>         > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
>         > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct
>         amd_gpu_scheduler *sched,
>         >        if (!amd_sched_entity_is_initialized(sched, entity))
>         >                return;
>         >
>         > -     /**
>         > -      * The client will not queue more IBs during this
>         fini, consume existing
>         > -      * queued IBs
>         > -     */
>         > -     wait_event(sched->job_scheduled,
>         amd_sched_entity_is_idle(entity));
>         > -
>         >        amd_sched_rq_remove_entity(rq, entity);
>         >        kfifo_free(&entity->job_queue);
>         >   }
>         > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > index 8cb41d3..ccbbcb0 100644
>         > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
>         > @@ -49,6 +49,7 @@ struct amd_sched_entity {
>         >
>         >        struct fence *dependency;
>         >        struct fence_cb                 cb;
>         > +     bool *ptr_guilty;
>         >   };
>         >
>         >   /**
>
>
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>         _______________________________________________
>         amd-gfx mailing list
>         amd-gfx at lists.freedesktop.org
>         <mailto:amd-gfx at lists.freedesktop.org>
>         https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170503/a3f1d085/attachment-0001.html>