TDR and VRAM lost handling in KMD (v2)

nicolai.haehnle@xxxxxxx (Nicolai Hähnle) · Thu, 12 Oct 2017 11:10:30 +0200

On 12.10.2017 10:49, Christian KÃ¶nig wrote:
>> However, !guilty && ctx->reset_counter != adev->reset_counter does not 
>> imply that the context was lost.
>>
>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
>> !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>
>> As far as I understand it, the case of !guilty && ctx->reset_counter 
>> != adev->reset_counter && ctx->vram_lost_counter == 
>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
>> GPU reset occurred, but it didn't affect our context.
> I disagree on that.
> 
> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there was a 
> reset but we haven't been causing it.
> 
> That the OpenGL extension is specified otherwise is unfortunate, but I 
> think we shouldn't use that for the kernel interface here.
Two counterpoints:

1. Why should any application care that there was a reset while it was 
idle? The OpenGL behavior is what makes sense.

2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today 
because we never return it :)

amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is in 
line with the OpenGL spec: we're conservatively returning that a reset 
happened because we don't know whether the context was affected, and we 
return UNKNOWN because we also don't know whether the context was guilty 
or not.

Returning AMDGPU_CTX_NO_RESET in the case of !guilty && 
ctx->vram_lost_counter == adev->vram_lost_counter is simply a refinement 
and improvement of the current, overly conservative behavior.

Cheers,
Nicolai

> 
> Regards,
> Christian.
> 
> Am 12.10.2017 um 10:44 schrieb Nicolai HÃ¤hnle:
>> I think we should stick to the plan where kernel contexts stay "stuck" 
>> after a GPU reset. This is the most robust behavior for the kernel.
>>
>> Even if the OpenGL spec says that an OpenGL context can be re-used 
>> without destroying and re-creating it, the UMD can take care of 
>> re-creating the kernel context.
>>
>> This means amdgpu_ctx_query should *not* reset ctx->reset_counter.
>>
>> Cheers,
>> Nicolai
>>
>>
>> On 12.10.2017 10:41, Nicolai HÃ¤hnle wrote:
>>> Hi Monk,
>>>
>>> Thanks for the summary. Most of it looks good to me, though I can't 
>>> speak to all the kernel internals.
>>>
>>> Just some comments:
>>>
>>> On 12.10.2017 10:03, Liu, Monk wrote:
>>>> lFor cs_submit() IOCTL:
>>>>
>>>> 1.check if current ctx been marked â??*guilty*â??and return 
>>>> â??*ECANCELED*â??Â Â  if so.
>>>>
>>>> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
>>>> return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != 
>>>> job->*vram_lost_counter* (Christian already submitted this patch)
>>>>
>>>> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch ? 
>>>> that way UMD know this context is under â??device lostâ??
>>>
>>> My plan for UMD is to always query the VRAM lost counter when any 
>>> kind of context lost situation is detected. So cs_submit() should 
>>> return an error in this situation, but it could just be ECANCELED. We 
>>> don't need to distinguish between different types of errors here.
>>>
>>>
>>>> lIntroduce a new IOCTL to let UMD query latest 
>>>> adev->*vram_lost_counter*:
>>>
>>> Christian already sent a patch for this.
>>>
>>>
>>>> lFor amdgpu_ctx_query():
>>>>
>>>> n*Donâ??t update ctx->reset_counter when querying this function, 
>>>> otherwise the query result is not consistent *
>>>
>>> Hmm. I misremembered part of the spec, see below.
>>>
>>>
>>>> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the ctx 
>>>> is â??*guilty*â??, no need to check â??ctx->reset_counterâ??
>>>
>>> Agreed.
>>>
>>>
>>>> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if the 
>>>> ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter *
>>>
>>> I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
>>> corresponding enums in user space APIs. I don't know how it works in 
>>> Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
>>> that the context was lost.
>>>
>>> However, !guilty && ctx->reset_counter != adev->reset_counter does 
>>> not imply that the context was lost.
>>>
>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>
>>> As far as I understand it, the case of !guilty && ctx->reset_counter 
>>> != adev->reset_counter && ctx->vram_lost_counter == 
>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
>>> GPU reset occurred, but it didn't affect our context.
>>>
>>> I unfortunately noticed another subtlety while re-reading the OpenGL 
>>> spec. OpenGL says that the OpenGL context itself does *not* have to 
>>> be re-created in order to recover from the reset. Re-creating all 
>>> objects in the context is sufficient.
>>>
>>> I believe this is the original motivation for why amdgpu_ctx_query() 
>>> will reset the ctx->reset_counter.
>>>
>>> For Mesa, it's still okay if the kernel keeps blocking submissions as 
>>> we can just recreate the kernel context. But OrcaGL is also affected.
>>>
>>> Does anybody know off-hand where the relevant parts of the Vulkan 
>>> spec are? I didn't actually find anything in a quick search.
>>>
>>>
>>> [snip]
>>>> For UMD behavior we still have something need to consider:
>>>>
>>>> If MESA creates a new context from an old context (share list?? Iâ??m 
>>>> not familiar with UMD , David Mao shall have some discuss on it with 
>>>> Nicolai), the new created contextâ??s vram_lost_counter
>>>>
>>>> And reset_counter shall all be ported from that old context , 
>>>> otherwise CS_SUBMIT will not block it which isnâ??t correct
>>>
>>> The kernel doesn't have to do anything for this, it is entirely the 
>>> UMD's responsibility. All UMD needs from KMD is the function for 
>>> querying the vram_lost_counter.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>>>
>>>> Need your feedback, thx
>>>>
>>>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
>>>> Behalf Of *Liu, Monk
>>>> *Sent:* 2017å¹´10æ??11æ?¥13:34
>>>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai 
>>>> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
>>>> Deucher, Alexander <Alexander.Deucher at amd.com>
>>>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>>>> amd-gfx at lists.freedesktop.org; Filipas, Mario 
>>>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, 
>>>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>>>> *Subject:* TDR and VRAM lost handling in KMD:
>>>>
>>>> Hi Christian & Nicolai,
>>>>
>>>> We need to achieve some agreements on what should MESA/UMD do and 
>>>> what should KMD do, *please give your comments with **â??okayâ??or 
>>>> â??Noâ??and your idea on below items,*
>>>>
>>>> lWhen a job timed out (set from lockup_timeout kernel parameter), 
>>>> What KMD should do in TDR routine :
>>>>
>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
>>>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out 
>>>> of this threadâ??s scope so no more discussion on it)
>>>>
>>>> 2.Set its fence error status to â??*ETIME*â??,
>>>>
>>>> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â??
>>>>
>>>> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t 
>>>> get re-scheduled to ring anymore.
>>>>
>>>> 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all 
>>>> their fence status to â??*ECANCELED*â??
>>>>
>>>> *6.*Force signal all fences that get kicked out by above two 
>>>> steps,*otherwise UMD will block forever if waiting on those fences*
>>>>
>>>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and 
>>>> SR-IOV implement with their favor style
>>>>
>>>> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
>>>> bare-metal can implement some function to judge, while for SR-IOV I 
>>>> prefer to read it from GIM side (for initial version we consider 
>>>> itâ??s always VRAM lost, till GIM side change aligned)
>>>>
>>>> 9.If VRAM lost not hit, continue, otherwise:
>>>>
>>>> a)Update adev->*vram_lost_counter*,
>>>>
>>>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since 
>>>> VRAM lost actually ruins all VRAM contents
>>>>
>>>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their 
>>>> fence status to â??*ECANCELDED*â??
>>>>
>>>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, 
>>>> do we need it ???)
>>>>
>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and 
>>>> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>>>
>>>> lFor cs_wait() IOCTL:
>>>>
>>>> After it found fence signaled, it should check with 
>>>> *â??dma_fence_get_statusâ?? *to see if there is error there,
>>>>
>>>> And return the error status of fence
>>>>
>>>> lFor cs_wait_fences() IOCTL:
>>>>
>>>> Similar with above approach
>>>>
>>>> lFor cs_submit() IOCTL:
>>>>
>>>> It need to check if current ctx been marked as â??*guilty*â??and return 
>>>> â??*ECANCELED*â??if so
>>>>
>>>> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>
>>>> This way, UMD can also block app from submitting, like @Nicolai 
>>>> mentioned, we can cache one copy of *vram_lost_counter* when 
>>>> enumerate physical device, and deny all
>>>>
>>>> gl-context from submitting if the counter queried bigger than that 
>>>> one cached in physical device. (looks a little overkill to me, but 
>>>> easy to implement )
>>>>
>>>> UMD can also return error to APP when creating gl-context if found 
>>>> current queried*vram_lost_counter *bigger than that one cached in 
>>>> physical device.
>>>>
>>>> BTW: I realized that gl-context is a little different with kernelâ??s 
>>>> context. Because for kernel. BO is not related with context but only 
>>>> with FD, while in UMD, BO have a backend
>>>>
>>>> gl-context, so block submitting in UMD layer is also needed although 
>>>> KMD will do its job as bottom line
>>>>
>>>> lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD take 
>>>> the control of robust extension feature, it will be UMDâ??s call to 
>>>> move, KMD only deny â??guiltyâ??context from submitting
>>>>
>>>> Need your feedback, thx
>>>>
>>>> Weâ??d better make TDR feature landed ASAP
>>>>
>>>> BR Monk
>>>>
>>>
>>
>