Out of this title, it let me think job->vm isn't safe as well, vm could be freed when it is being used amdgpu_ib_schedule, do you have any thoughts to solve it? Regards, David Zhou On 2017å¹´05æ??03æ?¥ 17:18, Christian König wrote: >>> I'm afraid not: CSA is gone with the VM, and VM is gone after app >>> close our FD, I don't see amdgpu_vm_fini() is depended on context >>> living or not ... > > See the teardown order in amdgpu_driver_postclose_kms(): >> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr); >> >> amdgpu_uvd_free_handles(adev, file_priv); >> amdgpu_vce_free_handles(adev, file_priv); >> >> amdgpu_vm_bo_rmv(adev, fpriv->prt_va); >> >> if (amdgpu_sriov_vf(adev)) { >> /* TODO: how to handle reserve failure */ >> BUG_ON(amdgpu_bo_reserve(adev->virt.csa_obj, false)); >> amdgpu_vm_bo_rmv(adev, fpriv->vm.csa_bo_va); >> fpriv->vm.csa_bo_va = NULL; >> amdgpu_bo_unreserve(adev->virt.csa_obj); >> } >> >> amdgpu_vm_fini(adev, &fpriv->vm); > > amdgpu_ctx_mgr_fini() waits for scheduling to finish and releases all > contexts of the current fd. > > If we don't release the context here because some jobs are still > executed we need to keep the UVD and VCE handle, the PRT VAs, the CSA > and even the whole VM structure alive. > >> I'll see if dma_fence could solve my issue, but I wish you can give >> me your detail idea > Please take a look at David's idea of using the fence_context to find > which jobs and entity to skip, that is even better than mine about the > fence status and should be trivial to implement because all the data > is already present we just need to use it. > > Regards, > Christian. > > Am 03.05.2017 um 11:08 schrieb Liu, Monk: >> 1,My idea is that userspace should rather gather the feedback during >> the next command submission. This has the advantage that you don't >> need to keep userspace alive till all jobs are done. >> >>> No, we need to clean the hw ring (cherry-pick out guilty entities' >>> job in all rings) after gpu reset, and we need fackly signal all >>> sched_fence in the guity entity as well, and we need mark context as >>> guilty so the next IOCTL on it will return -ENODEV. >>> I don't understand how your idea can solve my request ... >> 2,You need to keep quite a bunch of stuff alive (VM, CSA) when you >> don't tear down the ctx immediately. >> >>> I'm afraid not: CSA is gone with the VM, and VM is gone after app >>> close our FD, I don't see amdgpu_vm_fini() is depended on context >>> living or not ... >> 3, struct fence was renamed to struct dma_fence on newer kernels and >> a status field added to exactly this purpose. >> >> The Intel guys did this because they ran into the exactly same problem. >> >>> I'll see if dma_fence could solve my issue, but I wish you can give >>> me your detail idea >> >> BR Monk >> >> >> >> -----Original Message----- >> From: Christian König [mailto:deathsimple at vodafone.de] >> Sent: Wednesday, May 03, 2017 4:59 PM >> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org >> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job finished >> >>> 1, This is necessary otherwise how can I access entity pointer after a >>> job timedout >> No that isn't necessary. >> >> The problem with your idea is that you want to actively push the >> feedback/status from the job execution back to userspace when an error >> (timeout) happens. >> >> My idea is that userspace should rather gather the feedback during >> the next command submission. This has the advantage that you don't >> need to keep userspace alive till all jobs are done. >> >>> , and why it is dangerous ? >> You need to keep quite a bunch of stuff alive (VM, CSA) when you >> don't tear down the ctx immediately. >> >> We could split ctx tear down into freeing the resources and freeing >> the structure, but I think just gathering the information needed on >> CS is easier to do. >> >>> 2, what's the status field in the fences you were referring to ? I >>> need to judge if it could satisfy my requirement >> struct fence was renamed to struct dma_fence on newer kernels and a >> status field added to exactly this purpose. >> >> The Intel guys did this because they ran into the exactly same problem. >> >> Regards, >> Christian. >> >> Am 03.05.2017 um 05:30 schrieb Liu, Monk: >>> 1, This is necessary otherwise how can I access entity pointer after >>> a job timedout , and why it is dangerous ? >>> 2, what's the status field in the fences you were referring to ? I >>> need to judge if it could satisfy my requirement >>> >>> >>> >>> -----Original Message----- >>> From: Christian König [mailto:deathsimple at vodafone.de] >>> Sent: Monday, May 01, 2017 10:48 PM >>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org >>> Subject: Re: [PATCH 1/5] drm/amdgpu:keep ctx alive till all job >>> finished >>> >>> Am 01.05.2017 um 09:22 schrieb Monk Liu: >>>> for TDR guilty context feature, we need access ctx/s_entity field >>>> member through sched job pointer,so ctx must keep alive till all job >>>> from it signaled. >>> NAK, that is unnecessary and quite dangerous. >>> >>> Instead we have the designed status field in the fences which should >>> be checked for that. >>> >>> Regards, >>> Christian. >>> >>>> Change-Id: Ib87e9502f7a5c8c054c7e56956d7f7ad75998e43 >>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 +++++- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 9 +++++++++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 9 +++++++-- >>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 6 ------ >>>> drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 1 + >>>> 6 files changed, 23 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> index e330009..8e031d6 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >>>> @@ -760,10 +760,12 @@ struct amdgpu_ib { >>>> uint32_t flags; >>>> }; >>>> +struct amdgpu_ctx; >>>> + >>>> extern const struct amd_sched_backend_ops amdgpu_sched_ops; >>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned >>>> num_ibs, >>>> - struct amdgpu_job **job, struct amdgpu_vm *vm); >>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct >>>> +amdgpu_ctx *ctx); >>>> int amdgpu_job_alloc_with_ib(struct amdgpu_device *adev, >>>> unsigned size, >>>> struct amdgpu_job **job); >>>> @@ -802,6 +804,7 @@ struct amdgpu_ctx_mgr { >>>> struct amdgpu_ctx *amdgpu_ctx_get(struct amdgpu_fpriv >>>> *fpriv, uint32_t id); >>>> int amdgpu_ctx_put(struct amdgpu_ctx *ctx); >>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx); >>>> uint64_t amdgpu_ctx_add_fence(struct amdgpu_ctx *ctx, >>>> struct amdgpu_ring *ring, >>>> struct fence *fence); >>>> @@ -1129,6 +1132,7 @@ struct amdgpu_job { >>>> struct amdgpu_sync sync; >>>> struct amdgpu_ib *ibs; >>>> struct fence *fence; /* the hw fence */ >>>> + struct amdgpu_ctx *ctx; >>>> uint32_t preamble_status; >>>> uint32_t num_ibs; >>>> void *owner; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> index 699f5fe..267fb65 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c >>>> @@ -234,7 +234,7 @@ int amdgpu_cs_parser_init(struct >>>> amdgpu_cs_parser *p, void *data) >>>> } >>>> } >>>> - ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm); >>>> + ret = amdgpu_job_alloc(p->adev, num_ibs, &p->job, vm, p->ctx); >>>> if (ret) >>>> goto free_all_kdata; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> index b4bbbb3..81438af 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c >>>> @@ -25,6 +25,13 @@ >>>> #include <drm/drmP.h> >>>> #include "amdgpu.h" >>>> +struct amdgpu_ctx *amdgpu_ctx_kref_get(struct amdgpu_ctx *ctx) { >>>> + if (ctx) >>>> + kref_get(&ctx->refcount); >>>> + return ctx; >>>> +} >>>> + >>>> static int amdgpu_ctx_init(struct amdgpu_device *adev, struct >>>> amdgpu_ctx *ctx) >>>> { >>>> unsigned i, j; >>>> @@ -56,6 +63,8 @@ static int amdgpu_ctx_init(struct amdgpu_device >>>> *adev, struct amdgpu_ctx *ctx) >>>> rq, amdgpu_sched_jobs); >>>> if (r) >>>> goto failed; >>>> + >>>> + ctx->rings[i].entity.ptr_guilty = &ctx->guilty; /* kernel >>>> entity >>>> +doesn't have ptr_guilty */ >>>> } >>>> return 0; >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> index 690ef3d..208da11 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c >>>> @@ -40,7 +40,7 @@ static void amdgpu_job_timedout(struct >>>> amd_sched_job *s_job) >>>> } >>>> int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned >>>> num_ibs, >>>> - struct amdgpu_job **job, struct amdgpu_vm *vm) >>>> + struct amdgpu_job **job, struct amdgpu_vm *vm, struct >>>> +amdgpu_ctx *ctx) >>>> { >>>> size_t size = sizeof(struct amdgpu_job); >>>> @@ -57,6 +57,7 @@ int amdgpu_job_alloc(struct amdgpu_device >>>> *adev, unsigned num_ibs, >>>> (*job)->vm = vm; >>>> (*job)->ibs = (void *)&(*job)[1]; >>>> (*job)->num_ibs = num_ibs; >>>> + (*job)->ctx = amdgpu_ctx_kref_get(ctx); >>>> amdgpu_sync_create(&(*job)->sync); >>>> @@ -68,7 +69,7 @@ int amdgpu_job_alloc_with_ib(struct >>>> amdgpu_device *adev, unsigned size, >>>> { >>>> int r; >>>> - r = amdgpu_job_alloc(adev, 1, job, NULL); >>>> + r = amdgpu_job_alloc(adev, 1, job, NULL, NULL); >>>> if (r) >>>> return r; >>>> @@ -94,6 +95,10 @@ void amdgpu_job_free_resources(struct >>>> amdgpu_job *job) >>>> static void amdgpu_job_free_cb(struct amd_sched_job *s_job) >>>> { >>>> struct amdgpu_job *job = container_of(s_job, struct >>>> amdgpu_job, >>>> base); >>>> + struct amdgpu_ctx *ctx = job->ctx; >>>> + >>>> + if (ctx) >>>> + amdgpu_ctx_put(ctx); >>>> fence_put(job->fence); >>>> amdgpu_sync_free(&job->sync); >>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> index 6f4e31f..9100ca8 100644 >>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>> @@ -208,12 +208,6 @@ void amd_sched_entity_fini(struct >>>> amd_gpu_scheduler *sched, >>>> if (!amd_sched_entity_is_initialized(sched, entity)) >>>> return; >>>> - /** >>>> - * The client will not queue more IBs during this fini, >>>> consume existing >>>> - * queued IBs >>>> - */ >>>> - wait_event(sched->job_scheduled, >>>> amd_sched_entity_is_idle(entity)); >>>> - >>>> amd_sched_rq_remove_entity(rq, entity); >>>> kfifo_free(&entity->job_queue); >>>> } >>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> index 8cb41d3..ccbbcb0 100644 >>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h >>>> @@ -49,6 +49,7 @@ struct amd_sched_entity { >>>> struct fence *dependency; >>>> struct fence_cb cb; >>>> + bool *ptr_guilty; >>>> }; >>>> /** >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx