On 09.10.2017 14:33, Christian König wrote: > Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle: >> On 09.10.2017 13:12, Christian König wrote: >>>> >>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>> currently existing contexts? >>>> >>>> Impossible? "All currently existing contexts" is not a well-defined >>>> concept when multiple drivers co-exist in the same process. >>> >>> Ok, let me refine the question: I assume there are resources "shared" >>> between contexts like binary shader code for example which needs to >>> be reuploaded when VRAM is lost. >>> >>> How hard would it be to handle that correctly? >> >> Okay, that makes more sense :) >> >> With the current interface it's still pretty difficult, but if we >> could get a new per-device query ioctl which returns a "VRAM loss >> counter", it would be reasonably straight-forward. > > The problem with the VRAM lost counter is that this isn't save either. > E.g. you could have an application which uploads shaders, a GPU reset > happens and VRAM is lost and then the application creates a new context > and makes submission with broken shader binaries. Hmm. Here's how I imagined we'd be using a VRAM lost counter: int si_shader_binary_upload(...) { ... shader->bo_vram_lost_counter = sscreen->vram_lost_counter; shader->bo = pipe_buffer_create(...); ptr = sscreen->b.ws->buffer_map(shader->bo->buf, ...); ... copies ... sscreen->b.ws->buffer_unmap(shader->bo->buf); } int si_shader_select(...) { ... r = si_shader_select_with_key(ctx->sscreen, state, ...); if (r) return r; if (state->current->bo_vram_lost_counter != ctx->sscreen->vram_lost_counter) { ... re-upload sequence ... } } (Not shown: logic that compares ctx->vram_lost_counter with sscreen->vram_lost_counter and forces a re-validation of all state including shaders.) That should cover this scenario, shouldn't it? Oh... I see one problem. But it should be easy to fix: when creating a new amdgpu context, Mesa needs to query the vram lost counter. So then the sequence of events would be either: - VRAM lost counter starts at 0 - Mesa uploads a shader binary - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 - Mesa creates a new amdgpu context, queries the VRAM lost counter --> 1 - si_screen::vram_lost_counter is updated to 1 - Draw happens on the new context --> si_shader_select will catch the VRAM loss Or: - VRAM lost counter starts at 0 - Mesa uploads a shader binary - Mesa creates a new amdgpu context, VRAM lost counter still 0 - Unrelated GPU reset happens, kernel increments VRAM lost counter to 1 - Draw happens on the new context and proceeds normally ... - Mesa flushes the CS, and the kernel will return an error code because the device VRAM lost counter is different from the amdgpu context VRAM lost counter > So I would still vote for a separate IOCTL to reset the VRAM lost state > which is called *before" user space starts to reupload > shader/descriptors etc... The question is: is that separate IOCTL per-context or per-fd? If it's per-fd, then it's not compatible with multiple drivers. If it's per-context, then I don't see how it helps. Perhaps you could explain? > This way you also catch the case when another reset happens while you > re-upload things. My assumption would be that the re-upload happens *after* the new amdgpu context is created. Then the repeat reset should be caught by the kernel when we try to submit a CS on the new context (this is assuming that the kernel context's vram lost counter is initialized properly when the context is created): - Mesa prepares upload, sets shader->bo_vram_lost_counter to 0 - Mesa uploads a shader binary - While doing this, a GPU reset happens[0], kernel increments device VRAM lost counter to 1 - Draw happens with the new shader, Mesa proceeds normally ... - On flush / CS submit, the kernel detects the VRAM lost state and returns an error to Mesa [0] Out of curiosity: What happens on the CPU side if the PCI / full ASIC reset method is used? Is there a time window where we could get a SEGV? [snip] >> BTW, I still don't like ENODEV. It seems more like the kind of error >> code you'd return with hot-pluggable GPUs where the device can >> physically disappear... > > Yeah, ECANCELED sounds like a better alternative. But I think we should > still somehow note the fatality of loosing VRAM to userspace. > > How about ENODATA or EBADFD? According to the manpage, EBADFD is "File descriptor in bad state.". Sounds fitting :) Cheers, Nicolai > > Regards, > Christian. > >> >> Cheers, >> Nicolai >> >> >>> >>> Regards, >>> Christian. >>> >>> Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle: >>>> On 09.10.2017 12:59, Christian König wrote: >>>>> Nicolai, how hard would it be to handle ENODEV as failure for all >>>>> currently existing contexts? >>>> >>>> Impossible? "All currently existing contexts" is not a well-defined >>>> concept when multiple drivers co-exist in the same process. >>>> >>>> And what would be the purpose of this? If it's to support VRAM loss, >>>> having a per-context VRAM loss counter would enable each context to >>>> signal ECANCELED separately. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> >>>>> Monk, would it be ok with you when we return ENODEV only for >>>>> existing context when VRAM is lost and/or we have a strict mode GPU >>>>> reset? E.g. newly created contexts would continue work as they should. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle: >>>>>> Hi Monk, >>>>>> >>>>>> Yes, you're right, we're only using ECANCELED internally. But as a >>>>>> consequence, Mesa would already handle a kernel error of ECANCELED >>>>>> on context loss correctly :) >>>>>> >>>>>> Cheers, >>>>>> Nicolai >>>>>> >>>>>> On 09.10.2017 12:35, Liu, Monk wrote: >>>>>>> Hi Christian >>>>>>> >>>>>>> You reject some of my patches that returns -ENODEV, with the >>>>>>> cause that MESA doesn't do the handling on -ENODEV >>>>>>> >>>>>>> But if Nicolai can confirm that MESA do have a handling on >>>>>>> -ECANCELED, then we need to overall align our error code, on >>>>>>> detail below IOCTL can return error code: >>>>>>> >>>>>>> Amdgpu_cs_ioctl >>>>>>> Amdgpu_cs_wait_ioctl >>>>>>> Amdgpu_cs_wait_fences_ioctl >>>>>>> Amdgpu_info_ioctl >>>>>>> >>>>>>> >>>>>>> My patches is: >>>>>>> return -ENODEV on cs_ioctl if the context is detected guilty, >>>>>>> also return -ENODEV on cs_wait|cs_wait_fences if the fence is >>>>>>> signaled but with error -ETIME, >>>>>>> also return -ENODEV on info_ioctl so UMD can query if gpu reset >>>>>>> happened after the process created (because for strict mode we >>>>>>> block process instead of context) >>>>>>> >>>>>>> >>>>>>> according to Nicolai: >>>>>>> >>>>>>> amdgpu_cs_ioctl *can* return -ECANCELED, but to be frankly >>>>>>> speaking, kernel part doesn't have any place with "-ECANCELED" so >>>>>>> this solution on MESA side doesn't align with *current* amdgpu >>>>>>> driver, >>>>>>> which only return 0 on success or -EINVALID on other error but >>>>>>> definitely no "-ECANCELED" error code, >>>>>>> >>>>>>> so if we talking about community rules we shouldn't let MESA >>>>>>> handle -ECANCELED , we should have a unified error code >>>>>>> >>>>>>> + Marek >>>>>>> >>>>>>> BR Monk >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Haehnle, Nicolai >>>>>>> Sent: 2017å¹´10æ??9æ?¥ 18:14 >>>>>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Liu, Monk >>>>>>> <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; >>>>>>> amd-gfx at lists.freedesktop.org >>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>> fences in amd_sched_entity_fini >>>>>>> >>>>>>> On 09.10.2017 10:02, Christian König wrote: >>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>> kernel >>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>> IOCTL) >>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>> path >>>>>>>>> and considering the GPU hang occurred, >>>>>>>> Well that is only closed source behavior which is completely >>>>>>>> irrelevant for upstream development. >>>>>>>> >>>>>>>> As far as I know we haven't pushed the change to return -ENODEV >>>>>>>> upstream. >>>>>>> >>>>>>> FWIW, radeonsi currently expects -ECANCELED on CS submissions and >>>>>>> treats those as context lost. Perhaps we could use the same error >>>>>>> on fences? >>>>>>> That makes more sense to me than -ENODEV. >>>>>>> >>>>>>> Cheers, >>>>>>> Nicolai >>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Christian. >>>>>>>> >>>>>>>> Am 09.10.2017 um 08:42 schrieb Liu, Monk: >>>>>>>>> Christian >>>>>>>>> >>>>>>>>>> It would be really nice to have an error code set on >>>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>>> dma_fence_set_error() >>>>>>>>>> for this. >>>>>>>>> For gpu reset patches (already submitted to pub) I would make >>>>>>>>> kernel >>>>>>>>> return -ENODEV if the waiting fence (in cs_wait or wait_fences >>>>>>>>> IOCTL) >>>>>>>>> founded as error, that way UMD would run into robust extension >>>>>>>>> path >>>>>>>>> and considering the GPU hang occurred, >>>>>>>>> >>>>>>>>> Don't know if this is expected for the case of normal process >>>>>>>>> being >>>>>>>>> killed or crashed like Nicolai hit ... since there is no gpu >>>>>>>>> hang hit >>>>>>>>> >>>>>>>>> >>>>>>>>> BR Monk >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On >>>>>>>>> Behalf Of Christian K?nig >>>>>>>>> Sent: 2017å¹´9æ??28æ?¥ 23:01 >>>>>>>>> To: Nicolai Hähnle <nhaehnle at gmail.com>; >>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>> Cc: Haehnle, Nicolai <Nicolai.Haehnle at amd.com> >>>>>>>>> Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining >>>>>>>>> fences in amd_sched_entity_fini >>>>>>>>> >>>>>>>>> Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle: >>>>>>>>>> From: Nicolai Hähnle <nicolai.haehnle at amd.com> >>>>>>>>>> >>>>>>>>>> Highly concurrent Piglit runs can trigger a race condition >>>>>>>>>> where a >>>>>>>>>> pending SDMA job on a buffer object is never executed because the >>>>>>>>>> corresponding process is killed (perhaps due to a crash). >>>>>>>>>> Since the >>>>>>>>>> job's fences were never signaled, the buffer object was >>>>>>>>>> effectively >>>>>>>>>> leaked. Worse, the buffer was stuck wherever it happened to be at >>>>>>>>>> the time, possibly in VRAM. >>>>>>>>>> >>>>>>>>>> The symptom was user space processes stuck in interruptible waits >>>>>>>>>> with kernel stacks like: >>>>>>>>>> >>>>>>>>>>       [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250 >>>>>>>>>>       [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0 >>>>>>>>>>       [<ffffffffbc5e82d2>] >>>>>>>>>> reservation_object_wait_timeout_rcu+0x1c2/0x300 >>>>>>>>>>       [<ffffffffc03ce56f>] >>>>>>>>>> ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 >>>>>>>>>> [ttm] >>>>>>>>>>       [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm] >>>>>>>>>>       [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm] >>>>>>>>>>       [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm] >>>>>>>>>>       [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 >>>>>>>>>> [ttm] >>>>>>>>>>       [<ffffffffc042f523>] >>>>>>>>>> amdgpu_bo_create_restricted+0x1f3/0x470 >>>>>>>>>> [amdgpu] >>>>>>>>>>       [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu] >>>>>>>>>>       [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 >>>>>>>>>> [amdgpu] >>>>>>>>>>       [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 >>>>>>>>>> [amdgpu] >>>>>>>>>>       [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm] >>>>>>>>>>       [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] >>>>>>>>>>       [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0 >>>>>>>>>>       [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90 >>>>>>>>>>       [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad >>>>>>>>>>       [<ffffffffffffffff>] 0xffffffffffffffff >>>>>>>>>> >>>>>>>>>> Signed-off-by: Nicolai Hähnle <nicolai.haehnle at amd.com> >>>>>>>>>> Acked-by: Christian König <christian.koenig at amd.com> >>>>>>>>>> --- >>>>>>>>>>    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++- >>>>>>>>>>    1 file changed, 6 insertions(+), 1 deletion(-) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> index 54eb77cffd9b..32a99e980d78 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c >>>>>>>>>> @@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct >>>>>>>>>> amd_gpu_scheduler *sched, >>>>>>>>>> amd_sched_entity_is_idle(entity)); >>>>>>>>>>        amd_sched_rq_remove_entity(rq, entity); >>>>>>>>>>        if (r) { >>>>>>>>>>            struct amd_sched_job *job; >>>>>>>>>>            /* Park the kernel for a moment to make sure it isn't >>>>>>>>>> processing >>>>>>>>>>             * our enity. >>>>>>>>>>             */ >>>>>>>>>>            kthread_park(sched->thread); >>>>>>>>>>            kthread_unpark(sched->thread); >>>>>>>>>> -       while (kfifo_out(&entity->job_queue, &job, sizeof(job))) >>>>>>>>>> +       while (kfifo_out(&entity->job_queue, &job, >>>>>>>>>> sizeof(job))) { >>>>>>>>>> +           struct amd_sched_fence *s_fence = job->s_fence; >>>>>>>>>> +           amd_sched_fence_scheduled(s_fence); >>>>>>>>> It would be really nice to have an error code set on >>>>>>>>> s_fence->finished before it is signaled, use >>>>>>>>> dma_fence_set_error() for this. >>>>>>>>> >>>>>>>>> Additional to that it would be nice to note in the subject line >>>>>>>>> that >>>>>>>>> this is a rather important bug fix. >>>>>>>>> >>>>>>>>> With that fixed the whole series is Reviewed-by: Christian König >>>>>>>>> <christian.koenig at amd.com>. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> + amd_sched_fence_finished(s_fence); >>>>>>>>>> + dma_fence_put(&s_fence->finished); >>>>>>>>>>                sched->ops->free_job(job); >>>>>>>>>> +       } >>>>>>>>>>        } >>>>>>>>>>        kfifo_free(&entity->job_queue); >>>>>>>>>>    } >>>>>>>>>>    static void amd_sched_entity_wakeup(struct dma_fence *f, >>>>>>>>>> struct >>>>>>>>>> dma_fence_cb *cb) >>>>>>>>>>    { >>>>>>>>>>        struct amd_sched_entity *entity = >>>>>>>>>>            container_of(cb, struct amd_sched_entity, cb); >>>>>>>>>>        entity->dependency = NULL; >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>>> _______________________________________________ >>>>>>>>> amd-gfx mailing list >>>>>>>>> amd-gfx at lists.freedesktop.org >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >