TDR and VRAM lost handling in KMD (v2)

nicolai.haehnle@xxxxxxx (Nicolai Hähnle) · Thu, 12 Oct 2017 10:44:16 +0200

I think we should stick to the plan where kernel contexts stay "stuck" 
after a GPU reset. This is the most robust behavior for the kernel.

Even if the OpenGL spec says that an OpenGL context can be re-used 
without destroying and re-creating it, the UMD can take care of 
re-creating the kernel context.

This means amdgpu_ctx_query should *not* reset ctx->reset_counter.

Cheers,
Nicolai

On 12.10.2017 10:41, Nicolai HÃ¤hnle wrote:
> Hi Monk,
> 
> Thanks for the summary. Most of it looks good to me, though I can't 
> speak to all the kernel internals.
> 
> Just some comments:
> 
> On 12.10.2017 10:03, Liu, Monk wrote:
>> lFor cs_submit() IOCTL:
>>
>> 1.check if current ctx been marked â??*guilty*â??and return â??*ECANCELED*â?? 
>> Â Â if so.
>>
>> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
>> return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != 
>> job->*vram_lost_counter* (Christian already submitted this patch)
>>
>> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch ? 
>> that way UMD know this context is under â??device lostâ??
> 
> My plan for UMD is to always query the VRAM lost counter when any kind 
> of context lost situation is detected. So cs_submit() should return an 
> error in this situation, but it could just be ECANCELED. We don't need 
> to distinguish between different types of errors here.
> 
> 
>> lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:
> 
> Christian already sent a patch for this.
> 
> 
>> lFor amdgpu_ctx_query():
>>
>> n*Donâ??t update ctx->reset_counter when querying this function, 
>> otherwise the query result is not consistent *
> 
> Hmm. I misremembered part of the spec, see below.
> 
> 
>> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the ctx 
>> is â??*guilty*â??, no need to check â??ctx->reset_counterâ??
> 
> Agreed.
> 
> 
>> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if the 
>> ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter *
> 
> I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
> corresponding enums in user space APIs. I don't know how it works in 
> Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
> that the context was lost.
> 
> However, !guilty && ctx->reset_counter != adev->reset_counter does not 
> imply that the context was lost.
> 
> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
> !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
> 
> As far as I understand it, the case of !guilty && ctx->reset_counter != 
> adev->reset_counter && ctx->vram_lost_counter == adev->vram_lost_counter 
> should return AMDGPU_CTX_NO_RESET, because a GPU reset occurred, but it 
> didn't affect our context.
> 
> I unfortunately noticed another subtlety while re-reading the OpenGL 
> spec. OpenGL says that the OpenGL context itself does *not* have to be 
> re-created in order to recover from the reset. Re-creating all objects 
> in the context is sufficient.
> 
> I believe this is the original motivation for why amdgpu_ctx_query() 
> will reset the ctx->reset_counter.
> 
> For Mesa, it's still okay if the kernel keeps blocking submissions as we 
> can just recreate the kernel context. But OrcaGL is also affected.
> 
> Does anybody know off-hand where the relevant parts of the Vulkan spec 
> are? I didn't actually find anything in a quick search.
> 
> 
> [snip]
>> For UMD behavior we still have something need to consider:
>>
>> If MESA creates a new context from an old context (share list?? Iâ??m 
>> not familiar with UMD , David Mao shall have some discuss on it with 
>> Nicolai), the new created contextâ??s vram_lost_counter
>>
>> And reset_counter shall all be ported from that old context , 
>> otherwise CS_SUBMIT will not block it which isnâ??t correct
> 
> The kernel doesn't have to do anything for this, it is entirely the 
> UMD's responsibility. All UMD needs from KMD is the function for 
> querying the vram_lost_counter.
> 
> Cheers,
> Nicolai
> 
> 
>>
>> Need your feedback, thx
>>
>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
>> Behalf Of *Liu, Monk
>> *Sent:* 2017å¹´10æ??11æ?¥13:34
>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai 
>> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>> Deucher, Alexander <Alexander.Deucher at amd.com>
>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; 
>> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; 
>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>> *Subject:* TDR and VRAM lost handling in KMD:
>>
>> Hi Christian & Nicolai,
>>
>> We need to achieve some agreements on what should MESA/UMD do and what 
>> should KMD do, *please give your comments with **â??okayâ??or â??Noâ??and your 
>> idea on below items,*
>>
>> lWhen a job timed out (set from lockup_timeout kernel parameter), What 
>> KMD should do in TDR routine :
>>
>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
>> this threadâ??s scope so no more discussion on it)
>>
>> 2.Set its fence error status to â??*ETIME*â??,
>>
>> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â??
>>
>> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t 
>> get re-scheduled to ring anymore.
>>
>> 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all 
>> their fence status to â??*ECANCELED*â??
>>
>> *6.*Force signal all fences that get kicked out by above two 
>> steps,*otherwise UMD will block forever if waiting on those fences*
>>
>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
>> SR-IOV implement with their favor style
>>
>> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
>> bare-metal can implement some function to judge, while for SR-IOV I 
>> prefer to read it from GIM side (for initial version we consider itâ??s 
>> always VRAM lost, till GIM side change aligned)
>>
>> 9.If VRAM lost not hit, continue, otherwise:
>>
>> a)Update adev->*vram_lost_counter*,
>>
>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since VRAM 
>> lost actually ruins all VRAM contents
>>
>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence 
>> status to â??*ECANCELDED*â??
>>
>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do 
>> we need it ???)
>>
>> 11.Re-schedule all JOBs remains in mirror list to ring again and 
>> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>
>> lFor cs_wait() IOCTL:
>>
>> After it found fence signaled, it should check with 
>> *â??dma_fence_get_statusâ?? *to see if there is error there,
>>
>> And return the error status of fence
>>
>> lFor cs_wait_fences() IOCTL:
>>
>> Similar with above approach
>>
>> lFor cs_submit() IOCTL:
>>
>> It need to check if current ctx been marked as â??*guilty*â??and return 
>> â??*ECANCELED*â??if so
>>
>> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>>
>> This way, UMD can also block app from submitting, like @Nicolai 
>> mentioned, we can cache one copy of *vram_lost_counter* when enumerate 
>> physical device, and deny all
>>
>> gl-context from submitting if the counter queried bigger than that one 
>> cached in physical device. (looks a little overkill to me, but easy to 
>> implement )
>>
>> UMD can also return error to APP when creating gl-context if found 
>> current queried*vram_lost_counter *bigger than that one cached in 
>> physical device.
>>
>> BTW: I realized that gl-context is a little different with kernelâ??s 
>> context. Because for kernel. BO is not related with context but only 
>> with FD, while in UMD, BO have a backend
>>
>> gl-context, so block submitting in UMD layer is also needed although 
>> KMD will do its job as bottom line
>>
>> lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD take 
>> the control of robust extension feature, it will be UMDâ??s call to 
>> move, KMD only deny â??guiltyâ??context from submitting
>>
>> Need your feedback, thx
>>
>> Weâ??d better make TDR feature landed ASAP
>>
>> BR Monk
>>
>