Hi Christian You reject some of my patches that returns -ENODEV, with the cause that MESA doesn't do the handling on -ENODEV But if Nicolai can confirm that MESA do have a handling on -ECANCELED, then we need to overall align our error code, on detail below IOCTL can return error code: Amdgpu_cs_ioctl Amdgpu_cs_wait_ioctl Amdgpu_cs_wait_fences_ioctl Amdgpu_info_ioctl My patches is: return -ENODEV on cs_ioctl if the context is detected guilty, also return -ENODEV on cs_wait|cs_wait_fences if the fence is signaled but with error -ETIME, also return -ENODEV on info_ioctl so UMD can query if gpu reset happened after the process created (because for strict mode we block process instead of context) according to Nicolai: amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, kernel part doesn't have any place with "-ECANCELED" so this solution on MESA side doesn't align with *current* amdgpu driver, which only return 0 on success or -EINVALID on other error but definitely no "-ECANCELED" error code, so if we talking about community rules we shouldn't let MESA handle -ECANCELED , we should have a unified error code + Marek BR Monk -----Original Message----- From: Haehnle, Nicolai Sent: 2017å¹´10æ??9æ?¥ 18:14 To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini On 09.10.2017 10:02, Christian König wrote: >> For gpu reset patches (already submitted to pub) I would make kernel >> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >> founded as error, that way UMD would run into robust extension path >> and considering the GPU hang occurred, > Well that is only closed source behavior which is completely > irrelevant for upstream development. > > As far as I know we haven't pushed the change to return -ENODEV upstream. FWIW, radeonsi currently expects -ECANCELED on CS submissions and treats those as context lost. Perhaps we could use the same error on fences? That makes more sense to me than -ENODEV. Cheers, Nicolai > > Regards, > Christian. > > Am 09.10.2017 um 08:42 schrieb Liu, Monk: >> Christian >> >>> It would be really nice to have an error code set on >>> s_fence->finished before it is signaled, use dma_fence_set_error() >>> for this. >> For gpu reset patches (already submitted to pub) I would make kernel >> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >> founded as error, that way UMD would run into robust extension path >> and considering the GPU hang occurred, >> >> Don't know if this is expected for the case of normal process being >> killed or crashed like Nicolai hit ... since there is no gpu hang hit >> >> >> BR Monk >> >> >> >> >> -----Original Message----- >> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On >> Behalf Of Christian K?nig >> Sent: 2017å¹´9æ??28æ?¥ 23:01 >> To: Nicolai Hähnle <nhaehnle at gmail.com>; >> amd-gfx at lists.freedesktop.org >> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com> >> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >> fences in amd_sched_entity_fini >> >> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>> From: Nicolai Hähnle <nicolai.haehnle at amd.com> >>> >>> Highly concurrent Piglit runs can trigger a race condition where a >>> pending SDMA job on a buffer object is never executed because the >>> corresponding process is killed (perhaps due to a crash). Since the >>> job's fences were never signaled, the buffer object was effectively >>> leaked. Worse, the buffer was stuck wherever it happened to be at >>> the time, possibly in VRAM. >>> >>> The symptom was user space processes stuck in interruptible waits >>> with kernel stacks like: >>> >>>      [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>      [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>      [<ffffffffbc5e82d2>] >>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>      [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>> [ttm] >>>      [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>      [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>      [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>      [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>      [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>> [amdgpu] >>>      [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>      [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>> [amdgpu] >>>      [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>> [amdgpu] >>>      [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>      [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>      [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>      [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>      [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>      [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com> >>> Acked-by: Christian König <christian.koenig at amd.com> >>> --- >>>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>   1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> index 54eb77cffd9b..32a99e980d78 100644 >>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>> amd_gpu_scheduler *sched, >>>                       amd_sched_entity_is_idle(entity)); >>>       amd_sched_rq_remove_entity(rq, entity); >>>       if (r) { >>>           struct amd_sched_job *job; >>>           /* Park the kernel for a moment to make sure it isn't >>> processing >>>            * our enity. >>>            */ >>>           kthread_park(sched->thread); >>>           kthread_unpark(sched->thread); >>> -       while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>> +       while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>> +           struct amd_sched_fence *s_fence = job->s_fence; >>> +           amd_sched_fence_scheduled(s_fence); >> It would be really nice to have an error code set on >> s_fence->finished before it is signaled, use dma_fence_set_error() for this. >> >> Additional to that it would be nice to note in the subject line that >> this is a rather important bug fix. >> >> With that fixed the whole series is Reviewed-by: Christian König >> <christian.koenig at amd.com>. >> >> Regards, >> Christian. >> >>> +           amd_sched_fence_finished(s_fence); >>> +           dma_fence_put(&s_fence->finished); >>>               sched->ops->free_job(job); >>> +       } >>>       } >>>       kfifo_free(&entity->job_queue); >>>   } >>>   static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>> dma_fence_cb *cb) >>>   { >>>       struct amd_sched_entity *entity = >>>           container_of(cb, struct amd_sched_entity, cb); >>>       entity->dependency = NULL; >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >