TDR and VRAM lost handling in KMD (v2)

Monk.Liu@xxxxxxx (Liu, Monk) · Thu, 12 Oct 2017 11:37:44 +0000

Hi team

Very good, many policy and implement are agreed, looks we only have some arguments in amdgpu_ctx_query(), well I also confused with the current implement of it, â?¹

First, I want to know if you guys agree that we don't update ctx->reset_counter in amdgpu_ctx_query() ? because I want to make the query result always consistent upon a given context,

Second, I want to say that for KERNEL, it shouldn't use the term from MESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...
Because that way kernel will be very limited to certain UMD, so I suggest we totally re-name the context status, and each UMD has its own way to map the kernel context's result to gl-context/vk-context/etcâ?¦

Kernel should only provide below *FLAG bits* on a given context:

â?¢       Define AMDGPU_CTX_STATE_GUILTY  0x1             //as long as TDR detects a job hang, KMD set the context behind this context as "guilty"
â?¢       Define AMDGPU_CTX_STATE_VRAMLOST        0x2     //as long as there is a VRAM lost event hit after this context created, we mark this context "VRAM_LOST", so UMD can say that all BO under this context may lost their content,  since kernel have no relationship between context and BO so this is UMD's call to judge if a BO considered "VRAM lost" or not.
â?¢       Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a gpu reset occurred after context creation, this flag shall be set

Sample code:

Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) {
        if (ctx- >vram_lost_counter != adev->vram_lost_counter)
                out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;

        if (ctx- >reset_counter != adevâ??reset_counter) {
                out- >status |= AMDGPU_CTX_STATE_RESET;

                if (ctx- >guilty == TRUE)
                        out- >status |= AMDGPU_CTX_STATE_GUILTY;
        }

        return 0;
}

For UMD if it found "out.status == 0" means there is no gpu reset even occurred, the context is totally regular,

â?¢       A new IOCTL added for context:

Void amdgpu_ctx_reinit(){
        Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
        Ctxâ??reset_counter = adevâ??reset_counter;
}

if UMD decide *not* to release the "guilty" context but continue using it after UMD acknowledged GPU hang on certain job/context, I suggest UMD call "amdgpu_ctx_reinit()":

That way after you re-init() this context, you can get updated result from "amdgpu_ctx_query", which will probably give you "out.status == 0" as long as no gpu reset/vram lost hit after re-init().

BR Monk

-----Original Message-----
From: Koenig, Christian
Sent: 2017å¹´10æ??12æ?¥ 18:13
To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 11:44 schrieb Nicolai HÃ¤hnle:
> On 12.10.2017 11:35, Michel DÃ¤nzer wrote:
>> On 12/10/17 11:23 AM, Christian KÃ¶nig wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai HÃ¤hnle:
>>>> On 12.10.2017 10:49, Christian KÃ¶nig wrote:
>>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter
>>>>>> does not imply that the context was lost.
>>>>>>
>>>>>> The way I understand it, we should return
>>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>>>
>>>>>> As far as I understand it, the case of !guilty &&
>>>>>> ctx->reset_counter != adev->reset_counter &&
>>>>>> ctx->vram_lost_counter ==
>>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>>>>>> adev->because a
>>>>>> GPU reset occurred, but it didn't affect our context.
>>>>> I disagree on that.
>>>>>
>>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>>>> was a reset but we haven't been causing it.
>>>>>
>>>>> That the OpenGL extension is specified otherwise is unfortunate,
>>>>> but I think we shouldn't use that for the kernel interface here.
>>>> Two counterpoints:
>>>>
>>>> 1. Why should any application care that there was a reset while it
>>>> was idle? The OpenGL behavior is what makes sense.
>>>
>>> The application is certainly not interest if a reset happened or
>>> not, but I though that the driver stack might be.
>>>
>>>>
>>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>>>> because we never return it :)
>>>>
>>>
>>> Good point.
>>>
>>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
>>>> is in line with the OpenGL spec: we're conservatively returning
>>>> that a reset happened because we don't know whether the context was
>>>> affected, and we return UNKNOWN because we also don't know whether
>>>> the context was guilty or not.
>>>>
>>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>>> refinement and improvement of the current, overly conservative
>>>> behavior.
>>>
>>> Ok let's reenumerate what I think the different return values should
>>> mean:
>>>
>>> * AMDGPU_CTX_GUILTY_RESET
>>>
>>> guilty is set to true for this context.
>>>
>>> * AMDGPU_CTX_INNOCENT_RESET
>>>
>>> guilty is not set and vram_lost_counter has changed.
>>>
>>> * AMDGPU_CTX_UNKNOWN_RESET
>>>
>>> guilty is not set and vram_lost_counter has not changed, but
>>> gpu_reset_counter has changed.
>>
>> I don't understand the distinction you're proposing between
>> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
>> cases you're describing should return either
>> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or
>> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>
> I think it'd make more sense if it was called
> "AMDGPU_CTX_UNAFFECTED_RESET".
>
> So:
> - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
> we know that it's the context's fault
> - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,
> and we know that it *wasn't* the context's fault (no context job
> active)
> - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,
> and we don't know who's responsible (this could be returned in the
> unlikely case where context A's gfx job has not yet finished, but
> context B's gfx job has already started; it could be the fault of A,
> it could be the fault of B -- which somehow manages to hang a part of
> the hardware that then prevents A's job from finishing -- or it could
> be both; but it's a bit academic)
> - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
> wasn't affected
>
> This last value would currently just be discarded by Mesa (because we
> should only disturb applications when we have to), but perhaps
> somebody else could find it useful?

Yes, that's what I had in mind as well.

Cause otherwise we would return AMDGPU_CTX_NO_RESET while there actually was a reset and that certainly doesn't sound correct to me.

Regards,
Christian.

>
> Cheers,
> Nicolai

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171012/be67624a/attachment-0001.html>