TDR and VRAM lost handling in KMD (v2)

christian.koenig@xxxxxxx (Christian König) · Thu, 12 Oct 2017 11:23:04 +0200

Am 12.10.2017 um 11:10 schrieb Nicolai HÃ¤hnle:
> On 12.10.2017 10:49, Christian KÃ¶nig wrote:
>>> However, !guilty && ctx->reset_counter != adev->reset_counter does 
>>> not imply that the context was lost.
>>>
>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>
>>> As far as I understand it, the case of !guilty && ctx->reset_counter 
>>> != adev->reset_counter && ctx->vram_lost_counter == 
>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
>>> GPU reset occurred, but it didn't affect our context.
>> I disagree on that.
>>
>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there 
>> was a reset but we haven't been causing it.
>>
>> That the OpenGL extension is specified otherwise is unfortunate, but 
>> I think we shouldn't use that for the kernel interface here.
> Two counterpoints:
>
> 1. Why should any application care that there was a reset while it was 
> idle? The OpenGL behavior is what makes sense.

The application is certainly not interest if a reset happened or not, 
but I though that the driver stack might be.

>
> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today 
> because we never return it :)
>

Good point.

> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is 
> in line with the OpenGL spec: we're conservatively returning that a 
> reset happened because we don't know whether the context was affected, 
> and we return UNKNOWN because we also don't know whether the context 
> was guilty or not.
>
> Returning AMDGPU_CTX_NO_RESET in the case of !guilty && 
> ctx->vram_lost_counter == adev->vram_lost_counter is simply a 
> refinement and improvement of the current, overly conservative behavior.

Ok let's reenumerate what I think the different return values should mean:

* AMDGPU_CTX_GUILTY_RESET

guilty is set to true for this context.

* AMDGPU_CTX_INNOCENT_RESET

guilty is not set and vram_lost_counter has changed.

* AMDGPU_CTX_UNKNOWN_RESET

guilty is not set and vram_lost_counter has not changed, but 
gpu_reset_counter has changed.

* AMDGPU_CTX_NO_RESET

There wasn't a reset. So neither guilty is set, nor gpu_reset_counter 
nor vram_lost_counter has changed.

If we get to a point where we are certain that a GPU reset without 
loosing VRAM is harmless we can drop using gpu_reset_counter and also 
return AMDGPU_CTX_NO_RESET if we are certain that an application didn't 
do anything nasty. But till then I would rather say we should keep this.

Regards,
Christian.

>
> Cheers,
> Nicolai
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 12.10.2017 um 10:44 schrieb Nicolai HÃ¤hnle:
>>> I think we should stick to the plan where kernel contexts stay 
>>> "stuck" after a GPU reset. This is the most robust behavior for the 
>>> kernel.
>>>
>>> Even if the OpenGL spec says that an OpenGL context can be re-used 
>>> without destroying and re-creating it, the UMD can take care of 
>>> re-creating the kernel context.
>>>
>>> This means amdgpu_ctx_query should *not* reset ctx->reset_counter.
>>>
>>> Cheers,
>>> Nicolai
>>>
>>>
>>> On 12.10.2017 10:41, Nicolai HÃ¤hnle wrote:
>>>> Hi Monk,
>>>>
>>>> Thanks for the summary. Most of it looks good to me, though I can't 
>>>> speak to all the kernel internals.
>>>>
>>>> Just some comments:
>>>>
>>>> On 12.10.2017 10:03, Liu, Monk wrote:
>>>>> lFor cs_submit() IOCTL:
>>>>>
>>>>> 1.check if current ctx been marked â??*guilty*â??and return 
>>>>> â??*ECANCELED*â??   if so.
>>>>>
>>>>> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
>>>>> return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != 
>>>>> job->*vram_lost_counter* (Christian already submitted this patch)
>>>>>
>>>>> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch 
>>>>> ? that way UMD know this context is under â??device lostâ??
>>>>
>>>> My plan for UMD is to always query the VRAM lost counter when any 
>>>> kind of context lost situation is detected. So cs_submit() should 
>>>> return an error in this situation, but it could just be ECANCELED. 
>>>> We don't need to distinguish between different types of errors here.
>>>>
>>>>
>>>>> lIntroduce a new IOCTL to let UMD query latest 
>>>>> adev->*vram_lost_counter*:
>>>>
>>>> Christian already sent a patch for this.
>>>>
>>>>
>>>>> lFor amdgpu_ctx_query():
>>>>>
>>>>> n*Donâ??t update ctx->reset_counter when querying this function, 
>>>>> otherwise the query result is not consistent *
>>>>
>>>> Hmm. I misremembered part of the spec, see below.
>>>>
>>>>
>>>>> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the 
>>>>> ctx is â??*guilty*â??, no need to check â??ctx->reset_counterâ??
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if 
>>>>> the ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter *
>>>>
>>>> I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
>>>> corresponding enums in user space APIs. I don't know how it works 
>>>> in Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB 
>>>> means that the context was lost.
>>>>
>>>> However, !guilty && ctx->reset_counter != adev->reset_counter does 
>>>> not imply that the context was lost.
>>>>
>>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
>>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>
>>>> As far as I understand it, the case of !guilty && 
>>>> ctx->reset_counter != adev->reset_counter && ctx->vram_lost_counter 
>>>> == adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, 
>>>> because a GPU reset occurred, but it didn't affect our context.
>>>>
>>>> I unfortunately noticed another subtlety while re-reading the 
>>>> OpenGL spec. OpenGL says that the OpenGL context itself does *not* 
>>>> have to be re-created in order to recover from the reset. 
>>>> Re-creating all objects in the context is sufficient.
>>>>
>>>> I believe this is the original motivation for why 
>>>> amdgpu_ctx_query() will reset the ctx->reset_counter.
>>>>
>>>> For Mesa, it's still okay if the kernel keeps blocking submissions 
>>>> as we can just recreate the kernel context. But OrcaGL is also 
>>>> affected.
>>>>
>>>> Does anybody know off-hand where the relevant parts of the Vulkan 
>>>> spec are? I didn't actually find anything in a quick search.
>>>>
>>>>
>>>> [snip]
>>>>> For UMD behavior we still have something need to consider:
>>>>>
>>>>> If MESA creates a new context from an old context (share list?? 
>>>>> Iâ??m not familiar with UMD , David Mao shall have some discuss on 
>>>>> it with Nicolai), the new created contextâ??s vram_lost_counter
>>>>>
>>>>> And reset_counter shall all be ported from that old context , 
>>>>> otherwise CS_SUBMIT will not block it which isnâ??t correct
>>>>
>>>> The kernel doesn't have to do anything for this, it is entirely the 
>>>> UMD's responsibility. All UMD needs from KMD is the function for 
>>>> querying the vram_lost_counter.
>>>>
>>>> Cheers,
>>>> Nicolai
>>>>
>>>>
>>>>>
>>>>> Need your feedback, thx
>>>>>
>>>>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
>>>>> Behalf Of *Liu, Monk
>>>>> *Sent:* 2017å¹´10æ??11æ?¥13:34
>>>>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, 
>>>>> Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek 
>>>>> <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>
>>>>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
>>>>> amd-gfx at lists.freedesktop.org; Filipas, Mario 
>>>>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, 
>>>>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>>>>> *Subject:* TDR and VRAM lost handling in KMD:
>>>>>
>>>>> Hi Christian & Nicolai,
>>>>>
>>>>> We need to achieve some agreements on what should MESA/UMD do and 
>>>>> what should KMD do, *please give your comments with **â??okayâ??or 
>>>>> â??Noâ??and your idea on below items,*
>>>>>
>>>>> lWhen a job timed out (set from lockup_timeout kernel parameter), 
>>>>> What KMD should do in TDR routine :
>>>>>
>>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
>>>>> (*gpu_reset_counter* is used to force vm flush after GPU reset, 
>>>>> out of this threadâ??s scope so no more discussion on it)
>>>>>
>>>>> 2.Set its fence error status to â??*ETIME*â??,
>>>>>
>>>>> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â??
>>>>>
>>>>> 4.Kick out this job from schedulerâ??s mirror list, so this job 
>>>>> wonâ??t get re-scheduled to ring anymore.
>>>>>
>>>>> 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all 
>>>>> their fence status to â??*ECANCELED*â??
>>>>>
>>>>> *6.*Force signal all fences that get kicked out by above two 
>>>>> steps,*otherwise UMD will block forever if waiting on those fences*
>>>>>
>>>>> 7.Do gpu reset, which is can be some callbacks to let bare-metal 
>>>>> and SR-IOV implement with their favor style
>>>>>
>>>>> 8.After reset, KMD need to aware if the VRAM lost happens or not, 
>>>>> bare-metal can implement some function to judge, while for SR-IOV 
>>>>> I prefer to read it from GIM side (for initial version we consider 
>>>>> itâ??s always VRAM lost, till GIM side change aligned)
>>>>>
>>>>> 9.If VRAM lost not hit, continue, otherwise:
>>>>>
>>>>> a)Update adev->*vram_lost_counter*,
>>>>>
>>>>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since 
>>>>> VRAM lost actually ruins all VRAM contents
>>>>>
>>>>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their 
>>>>> fence status to â??*ECANCELDED*â??
>>>>>
>>>>> 10.Do GTT recovery and VRAM page tables/entries recovery 
>>>>> (optional, do we need it ???)
>>>>>
>>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and 
>>>>> restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>>>>>
>>>>> lFor cs_wait() IOCTL:
>>>>>
>>>>> After it found fence signaled, it should check with 
>>>>> *â??dma_fence_get_statusâ?? *to see if there is error there,
>>>>>
>>>>> And return the error status of fence
>>>>>
>>>>> lFor cs_wait_fences() IOCTL:
>>>>>
>>>>> Similar with above approach
>>>>>
>>>>> lFor cs_submit() IOCTL:
>>>>>
>>>>> It need to check if current ctx been marked as â??*guilty*â??and 
>>>>> return â??*ECANCELED*â??if so
>>>>>
>>>>> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
>>>>>
>>>>> This way, UMD can also block app from submitting, like @Nicolai 
>>>>> mentioned, we can cache one copy of *vram_lost_counter* when 
>>>>> enumerate physical device, and deny all
>>>>>
>>>>> gl-context from submitting if the counter queried bigger than that 
>>>>> one cached in physical device. (looks a little overkill to me, but 
>>>>> easy to implement )
>>>>>
>>>>> UMD can also return error to APP when creating gl-context if found 
>>>>> current queried*vram_lost_counter *bigger than that one cached in 
>>>>> physical device.
>>>>>
>>>>> BTW: I realized that gl-context is a little different with 
>>>>> kernelâ??s context. Because for kernel. BO is not related with 
>>>>> context but only with FD, while in UMD, BO have a backend
>>>>>
>>>>> gl-context, so block submitting in UMD layer is also needed 
>>>>> although KMD will do its job as bottom line
>>>>>
>>>>> lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD 
>>>>> take the control of robust extension feature, it will be UMDâ??s 
>>>>> call to move, KMD only deny â??guiltyâ??context from submitting
>>>>>
>>>>> Need your feedback, thx
>>>>>
>>>>> Weâ??d better make TDR feature landed ASAP
>>>>>
>>>>> BR Monk
>>>>>
>>>>
>>>
>>
>