On 09.10.2017 12:59, Christian König wrote: > Nicolai, how hard would it be to handle ENODEV as failure for all > currently existing contexts? Impossible? "All currently existing contexts" is not a well-defined concept when multiple drivers co-exist in the same process. And what would be the purpose of this? If it's to support VRAM loss, having a per-context VRAM loss counter would enable each context to signal ECANCELED separately. Cheers, Nicolai > > Monk, would it be ok with you when we return ENODEV only for existing > context when VRAM is lost and/or we have a strict mode GPU reset? E.g. > newly created contexts would continue work as they should. > > Regards, > Christian. > > Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >> Hi Monk, >> >> Yes, you're right, we're only using ECANCELED internally. But as a >> consequence, Mesa would already handle a kernel error of ECANCELED on >> context loss correctly :) >> >> Cheers, >> Nicolai >> >> On 09.10.2017 12:35, Liu, Monk wrote: >>> Hi Christian >>> >>> You reject some of my patches that returns -ENODEV, with the cause >>> that MESA doesn't do the handling on -ENODEV >>> >>> But if Nicolai can confirm that MESA do have a handling on >>> -ECANCELED, then we need to overall align our error code, on detail >>> below IOCTL can return error code: >>> >>> Amdgpu_cs_ioctl >>> Amdgpu_cs_wait_ioctl >>> Amdgpu_cs_wait_fences_ioctl >>> Amdgpu_info_ioctl >>> >>> >>> My patches is: >>> return -ENODEV on cs_ioctl if the context is detected guilty, >>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>> signaled but with error -ETIME, >>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>> happened after the process created (because for strict mode we block >>> process instead of context) >>> >>> >>> according to Nicolai: >>> >>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly speaking, >>> kernel part doesn't have any place with "-ECANCELED" so this solution >>> on MESA side doesn't align with *current* amdgpu driver, >>> which only return 0 on success or -EINVALID on other error but >>> definitely no "-ECANCELED" error code, >>> >>> so if we talking about community rules we shouldn't let MESA handle >>> -ECANCELED , we should have a unified error code >>> >>> + Marek >>> >>> BR Monk >>> >>> >>> >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: 2017å¹´10æ??9æ?¥ 18:14 >>> To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk >>> <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; >>> amd-gfx at lists.freedesktop.org >>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>> fences in amd_sched_entity_fini >>> >>> On 09.10.2017 10:02, Christian König wrote: >>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>>>> founded as error, that way UMD would run into robust extension path >>>>> and considering the GPU hang occurred, >>>> Well that is only closed source behavior which is completely >>>> irrelevant for upstream development. >>>> >>>> As far as I know we haven't pushed the change to return -ENODEV >>>> upstream. >>> >>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>> treats those as context lost. Perhaps we could use the same error on >>> fences? >>> That makes more sense to me than -ENODEV. >>> >>> Cheers, >>> Nicolai >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>> Christian >>>>> >>>>>> It would be really nice to have an error code set on >>>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>>> for this. >>>>> For gpu reset patches (already submitted to pub) I would make kernel >>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences IOCTL) >>>>> founded as error, that way UMD would run into robust extension path >>>>> and considering the GPU hang occurred, >>>>> >>>>> Don't know if this is expected for the case of normal process being >>>>> killed or crashed like Nicolai hit ... since there is no gpu hang hit >>>>> >>>>> >>>>> BR Monk >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On >>>>> Behalf Of Christian K?nig >>>>> Sent: 2017å¹´9æ??28æ?¥ 23:01 >>>>> To: Nicolai Hähnle <nhaehnle at gmail.com>; >>>>> amd-gfx at lists.freedesktop.org >>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com> >>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>> fences in amd_sched_entity_fini >>>>> >>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com> >>>>>> >>>>>> Highly concurrent Piglit runs can trigger a race condition where a >>>>>> pending SDMA job on a buffer object is never executed because the >>>>>> corresponding process is killed (perhaps due to a crash). Since the >>>>>> job's fences were never signaled, the buffer object was effectively >>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>> the time, possibly in VRAM. >>>>>> >>>>>> The symptom was user space processes stuck in interruptible waits >>>>>> with kernel stacks like: >>>>>> >>>>>>       [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>       [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>       [<ffffffffbc5e82d2>] >>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>       [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>> [ttm] >>>>>>       [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>       [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>       [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>       [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm] >>>>>>       [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>> [amdgpu] >>>>>>       [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>       [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>> [amdgpu] >>>>>>       [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>> [amdgpu] >>>>>>       [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>       [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>       [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>       [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>       [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>       [<ffffffffffffffff>] 0xffffffffffffffff >>>>>> >>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com> >>>>>> Acked-by: Christian König <christian.koenig at amd.com> >>>>>> --- >>>>>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>    1 file changed, 6 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>> amd_gpu_scheduler *sched, >>>>>>                        amd_sched_entity_is_idle(entity)); >>>>>>        amd_sched_rq_remove_entity(rq, entity); >>>>>>        if (r) { >>>>>>            struct amd_sched_job *job; >>>>>>            /* Park the kernel for a moment to make sure it isn't >>>>>> processing >>>>>>             * our enity. >>>>>>             */ >>>>>>            kthread_park(sched->thread); >>>>>>            kthread_unpark(sched->thread); >>>>>> -       while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>> +       while (kfifo_out(&entity->job_queue, &job, sizeof(job))) { >>>>>> +           struct amd_sched_fence *s_fence = job->s_fence; >>>>>> +           amd_sched_fence_scheduled(s_fence); >>>>> It would be really nice to have an error code set on >>>>> s_fence->finished before it is signaled, use dma_fence_set_error() >>>>> for this. >>>>> >>>>> Additional to that it would be nice to note in the subject line that >>>>> this is a rather important bug fix. >>>>> >>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>> <christian.koenig at amd.com>. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> + amd_sched_fence_finished(s_fence); >>>>>> +           dma_fence_put(&s_fence->finished); >>>>>>                sched->ops->free_job(job); >>>>>> +       } >>>>>>        } >>>>>>        kfifo_free(&entity->job_queue); >>>>>>    } >>>>>>    static void amd_sched_entity_wakeup(struct dma_fence *f, struct >>>>>> dma_fence_cb *cb) >>>>>>    { >>>>>>        struct amd_sched_entity *entity = >>>>>>            container_of(cb, struct amd_sched_entity, cb); >>>>>>        entity->dependency = NULL; >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx at lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>> >> >