> Please give me some detail approach on dma_fence, thanks ! > I think the idea of using the fence_context is even better than using the fence_status. It is already present for a while and filtering the jobs/entities by it needs something like 10 lines of code. The fence status is new and only available on newer kernels (we would need to rebase to amd-staging-4.11). I will try to dig up later today the how the Intel guys solved that problem. Christian. Am 03.05.2017 um 11:14 schrieb Liu, Monk: > > I thought of that for the first place as well, that keep an ID and > parsing the ID with every job not signaled to see if the job > > Belongs to something guilty , But keep ctx alive is more simple so I > choose this way. > > But I admit it is doable as well, and I want to compare this method > and the dma_fence status filed you mentioned, > > Please give me some detail approach on dma_fence, thanks ! > > BR Monk > > *From:*Christian König [mailto:deathsimple at vodafone.de] > *Sent:* Wednesday, May 03, 2017 5:12 PM > *To:* Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liu, Monk > <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org > *Subject:* Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished > > > 1) fence_context have chance to incorrectly represent the context > behind it, because the number can be used up and will wrapped > around from beginning > > No, fence_context is unique in global in kernel. > > Yeah, I would go with that as well. > > You note the fence_context of the job which caused the trouble. > > Then take a look at all jobs on the recovery list and remove the ones > with the same fence_context number. > > Then take a look at all entities on the runlist and remove the one > with the same fence_context number. If it is already freed you won't > find any, but that shouldn't be a problem. > > This way you effectively can prevent other jobs from the same context > from running and the context query can simply check if the entity is > still on the runlist to figure out if it was guilty or not. > > Regards, > Christian. > > Am 03.05.2017 um 09:23 schrieb zhoucm1: > > On 2017å¹´05æ??03æ?¥ 14:02, Liu, Monk wrote: > > You can add ctx as filed of job, but not get reference of it, > when you try to use ctx, just check if ctx == NULL. > > that doesn't work at all... job->ctx will always be non-NULL > after it is initialized, you just refer to a wild pointer > after CTX released > > job->ctx is a **ctx, which could resolve your this concern. > > > > Another stupid method: > Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the > job->fence->fence_context with > ctx->ring[].entity->fence_context. if not found, then ctx is > freed, otherwise you can do your things for this ctx. > > 1) fence_context have chance to incorrectly represent the > context behind it, because the number can be used up and will > wrapped around from beginning > > No, fence_context is unique in global in kernel. > > Regards, > David Zhou > > 2) why not just keep CTX alive, which is much simple than > this method > > BR > > -----Original Message----- > From: Zhou, David(ChunMing) > Sent: Wednesday, May 03, 2017 12:54 PM > To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>; > Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>; > Christian König <deathsimple at vodafone.de> > <mailto:deathsimple at vodafone.de>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all > job finished > > You can add ctx as filed of job, but not get reference of it, > when you try to use ctx, just check if ctx == NULL. > > Another stupid method: > Use idr_for_each_entry(..job->vm->ctx_mgr...) and compare the > job->fence->fence_context with > ctx->ring[].entity->fence_context. if not found, then ctx is > freed, otherwise you can do your things for this ctx. > > Regards, > David Zhou > > -----Original Message----- > From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] > On Behalf Of Liu, Monk > Sent: Wednesday, May 03, 2017 11:57 AM > To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>; > Christian König <deathsimple at vodafone.de> > <mailto:deathsimple at vodafone.de>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all > job finished > > @Christian, > > The thing is I need mark all entities behind this timeout job > as guilty, so I use a member in entity named "ptr_guilty", to > point to The member in amdgpu_ctx named "guilty", that way > read *entity->ptr_guilty can get you if this entity is > "invalid", and set *entity->ptr_guilty Can let you set all > entities belongs to the context as "invalid". > > Above logic is to guarantee we can kick out all guilty > entities in entity kfifo, and also block all IOCTL with a ctx > handle point to this guilty context, And we only recovery > other jobs/entities/context after scheduler unparked > > If you reject the patch to keep ctx alive till all jobs > signaled, please give me a solution to satisfy above logic > > BR Monk > > -----Original Message----- > From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] > On Behalf Of Liu, Monk > Sent: Wednesday, May 03, 2017 11:31 AM > To: Christian König <deathsimple at vodafone.de> > <mailto:deathsimple at vodafone.de>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > Subject: RE: [PATCH 1/5] drm/amdgpu:keep ctx alive till all > job finished > > 1, This is necessary otherwise how can I access entity pointer > after a job timedout , and why it is dangerous ? > 2, what's the status field in the fences you were referring to > ? I need to judge if it could satisfy my requirement > > > > -----Original Message----- > From: Christian König [mailto:deathsimple at vodafone.de] > Sent: Monday, May 01, 2017 10:48 PM > To: Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>; > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all > job finished > > Am 01.05.2017 um 09:22 schrieb Monk Liu: > > for TDR guilty context feature, we need access ctx/s_entity > field > > member through sched job pointer,so ctx must keep alive till > all job > > from it signaled. > > NAK, that is unnecessary and quite dangerous. > > Instead we have the designed status field in the fences which > should be checked for that. > > Regards, > Christian. > > > > > Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43 > > Signed-off-by: Monk Liu <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++- > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++ > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++-- > > drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------ > > drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 + > > 6 files changed, 23 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > index e330009..8e031d6 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > > @@ -760,10 +760,12 @@ struct amdgpu_ib { > > uint32_t flags; > > }; > > > > +struct amdgpu_ctx; > > + > > extern const struct amd_sched_backend_ops amdgpu_sched_ops; > > > > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned > num_ibs, > > - struct amdgpu_job **job, struct amdgpu_vm > *vm); > > + struct amdgpu_job **job, struct amdgpu_vm > *vm, struct > > +amdgpu_ctx *ctx); > > int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, > unsigned size, > > struct amdgpu_job **job); > > > > @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr { > > > > struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv > *fpriv, uint32_t id); > > int amdgpu_ctx_put(struct amdgpu_ctx *ctx); > > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx); > > > > uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, > struct amdgpu_ring *ring, > > struct fence *fence); > > @@ -1129,6 +1132,7 @@ struct amdgpu_job { > > struct amdgpu_sync sync; > > struct amdgpu_ib *ibs; > > struct fence *fence; /* the hw fence */ > > + struct amdgpu_ctx *ctx; > > uint32_t preamble_status; > > uint32_t num_ibs; > > void *owner; > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > index 699f5fe..267fb65 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct > amdgpu_cs_parser *p, void *data) > > } > > } > > > > - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm); > > + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, > p->ctx); > > if (ret) > > goto free_all_kdata; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > index b4bbbb3..81438af 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > > @@ -25,6 +25,13 @@ > > #include <drm/drmP.h> > > #include "amdgpu.h" > > > > +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx > *ctx) { > > + if (ctx) > > + kref_get(&ctx->refcount); > > + return ctx; > > +} > > + > > static int amdgpu_ctx_init(struct amdgpu_device *adev, > struct amdgpu_ctx *ctx) > > { > > unsigned i, j; > > @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct > amdgpu_device *adev, struct amdgpu_ctx *ctx) > > rq, amdgpu_sched_jobs); > > if (r) > > goto failed; > > + > > + ctx->rings[i].entity.ptr_guilty = > &ctx->guilty; /* kernel entity > > +doesn't have ptr_guilty */ > > } > > > > return 0; > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > index 690ef3d..208da11 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct > amd_sched_job *s_job) > > } > > > > int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned > num_ibs, > > - struct amdgpu_job **job, struct amdgpu_vm > *vm) > > + struct amdgpu_job **job, struct amdgpu_vm > *vm, struct > > +amdgpu_ctx *ctx) > > { > > size_t size = sizeof(struct amdgpu_job); > > > > @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device > *adev, unsigned num_ibs, > > (*job)->vm = vm; > > (*job)->ibs = (void *)&(*job)[1]; > > (*job)->num_ibs = num_ibs; > > + (*job)->ctx = amdgpu_ctx_kref_get(ctx); > > > > amdgpu_sync_create(&(*job)->sync); > > > > @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct > amdgpu_device *adev, unsigned size, > > { > > int r; > > > > - r = amdgpu_job_alloc(adev, 1, job, NULL); > > + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL); > > if (r) > > return r; > > > > @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct > amdgpu_job *job) > > static void amdgpu_job_free_cb(struct amd_sched_job *s_job) > > { > > struct amdgpu_job *job = container_of(s_job, struct > amdgpu_job, > > base); > > + struct amdgpu_ctx *ctx = job->ctx; > > + > > + if (ctx) > > + amdgpu_ctx_put(ctx); > > > > fence_put(job->fence); > > amdgpu_sync_free(&job->sync); > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > index 6f4e31f..9100ca8 100644 > > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c > > @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct > amd_gpu_scheduler *sched, > > if (!amd_sched_entity_is_initialized(sched, entity)) > > return; > > > > - /** > > - * The client will not queue more IBs during this > fini, consume existing > > - * queued IBs > > - */ > > - wait_event(sched->job_scheduled, > amd_sched_entity_is_idle(entity)); > > - > > amd_sched_rq_remove_entity(rq, entity); > > kfifo_free(&entity->job_queue); > > } > > diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > index 8cb41d3..ccbbcb0 100644 > > --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h > > @@ -49,6 +49,7 @@ struct amd_sched_entity { > > > > struct fence *dependency; > > struct fence_cb cb; > > + bool *ptr_guilty; > > }; > > > > /** > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20170503/a3f1d085/attachment-0001.html>