TDR and VRAM lost handling in KMD (v2)

Monk.Liu@xxxxxxx (Liu, Monk) · Fri, 13 Oct 2017 03:57:41 +0000

>That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here.

unfortunately the current scheme on amdgpu_ctx_query() wonâ??t work with TDR feature, which is aim to support vulkan/mesa/close-ogl/radv â?¦
Itâ??s enumeration is too limited to MESA implement â?¦

Do you have good idea ?  both keep the compatibility here and give flexible ?
looks like we need to add a new amdgpu_ctx_query_2() INTERFACE â?¦.

Â·        A new IOCTL added for context:

Void amdgpu_ctx_reinit(){
        Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
        Ctxâ??reset_counter = adevâ??reset_counter;
}

Mhm, is there any advantage to just creating a new context?

[ML] sorry, this function is wrong, here is my original idea:

MESA can create a new ctx based on an old one,like:

Create gl-ctx1,
Create BO-A under gl-ctx1 â?¦
VRAM LOST
Create gl-ctx2 from gl-ctx1 (share list, Iâ??m not familiar with UMD, David Mao an Nicolai can input)
Create BO-b UNDER gl-ctx2 â?¦
Submit job upon gl-ctx2, but it can refer to BO-A,

With our design, kernel wonâ??t block submit from context2 (from gl-ctx2) since its vram_lost_counter equals to latest adev copy
But gl-ctx2 is a clone from gl-ctx1, the only difference is the vram_lost/gpu_reset counter is updated to latest one
So logically we should also block submit from gl-ctx2 (mapping to kernel context2), and we failed do so â?¦

Thatâ??s why I want to add a new â??amdgpu_ctx_cloneâ??, which should work like:

Int amdgpu_ctx_clone(struct context *original, struct context *new) {
    New->vram_lost_counter = original->vram_lost_counter;
    New->gpu_reset_counter = original->gpu_reset_counter;
}

BR Monk

From: Koenig, Christian
Sent: 2017å¹´10æ??12æ?¥ 19:50
To: Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 13:37 schrieb Liu, Monk:
Hi team

Very good, many policy and implement are agreed, looks we only have some arguments in amdgpu_ctx_query(), well I also confused with the current implement of it, â?¹

First, I want to know if you guys agree that we don't update ctx->reset_counter in amdgpu_ctx_query() ? because I want to make the query result always consistent upon a given context,

That sounds like a good idea to me, but I'm not sure if it won't break userspace (I don't think so). Nicolai or Marek need to comment.

Second, I want to say that for KERNEL, it shouldn't use the term from MESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...
Because that way kernel will be very limited to certain UMD, so I suggest we totally re-name the context status, and each UMD has its own way to map the kernel context's result to gl-context/vk-context/etcâ?¦

Yes, completely agree.

Kernel should only provide below *FLAG bits* on a given context:

Â·        Define AMDGPU_CTX_STATE_GUILTY 0x1             //as long as TDR detects a job hang, KMD set the context behind this context as "guilty"
Â·        Define AMDGPU_CTX_STATE_VRAMLOST         0x2      //as long as there is a VRAM lost event hit after this context created, we mark this context "VRAM_LOST", so UMD can say that all BO under this context may lost their content,  since kernel have no relationship between context and BO so this is UMD's call to judge if a BO considered "VRAM lost" or not.
Â·        Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a gpu reset occurred after context creation, this flag shall be set

That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here.

Sample code:

Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) {
        if (ctx- >vram_lost_counter != adev->vram_lost_counter)
                out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;

        if (ctx- >reset_counter != adevâ??reset_counter) {
                out- >status |= AMDGPU_CTX_STATE_RESET;

                if (ctx- >guilty == TRUE)
                        out- >status |= AMDGPU_CTX_STATE_GUILTY;
        }

        return 0;
}

For UMD if it found "out.status == 0" means there is no gpu reset even occurred, the context is totally regular,

Â·        A new IOCTL added for context:

Void amdgpu_ctx_reinit(){
        Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
        Ctxâ??reset_counter = adevâ??reset_counter;
}

Mhm, is there any advantage to just creating a new context?

Regards,
Christian.

if UMD decide *not* to release the "guilty" context but continue using it after UMD acknowledged GPU hang on certain job/context, I suggest UMD call "amdgpu_ctx_reinit()":

That way after you re-init() this context, you can get updated result from "amdgpu_ctx_query", which will probably give you "out.status == 0" as long as no gpu reset/vram lost hit after re-init().

BR Monk

-----Original Message-----
From: Koenig, Christian
Sent: 2017å¹´10æ??12æ?¥ 18:13
To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com><mailto:Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net><mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com><mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com><mailto:Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com><mailto:Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com><mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com><mailto:David.Mao at amd.com>
Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com><mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org<mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario <Mario.Filipas at amd.com><mailto:Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com><mailto:Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com><mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com><mailto:Jerry.Jiang at amd.com>
Subject: Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 11:44 schrieb Nicolai HÃ¤hnle:
> On 12.10.2017 11:35, Michel DÃ¤nzer wrote:
>> On 12/10/17 11:23 AM, Christian KÃ¶nig wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai HÃ¤hnle:
>>>> On 12.10.2017 10:49, Christian KÃ¶nig wrote:
>>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter
>>>>>> does not imply that the context was lost.
>>>>>>
>>>>>> The way I understand it, we should return
>>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>>>>>
>>>>>> As far as I understand it, the case of !guilty &&
>>>>>> ctx->reset_counter != adev->reset_counter &&
>>>>>> ctx->vram_lost_counter ==
>>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>>>>>> adev->because a
>>>>>> GPU reset occurred, but it didn't affect our context.
>>>>> I disagree on that.
>>>>>
>>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>>>> was a reset but we haven't been causing it.
>>>>>
>>>>> That the OpenGL extension is specified otherwise is unfortunate,
>>>>> but I think we shouldn't use that for the kernel interface here.
>>>> Two counterpoints:
>>>>
>>>> 1. Why should any application care that there was a reset while it
>>>> was idle? The OpenGL behavior is what makes sense.
>>>
>>> The application is certainly not interest if a reset happened or
>>> not, but I though that the driver stack might be.
>>>
>>>>
>>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>>>> because we never return it :)
>>>>
>>>
>>> Good point.
>>>
>>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
>>>> is in line with the OpenGL spec: we're conservatively returning
>>>> that a reset happened because we don't know whether the context was
>>>> affected, and we return UNKNOWN because we also don't know whether
>>>> the context was guilty or not.
>>>>
>>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>>> refinement and improvement of the current, overly conservative
>>>> behavior.
>>>
>>> Ok let's reenumerate what I think the different return values should
>>> mean:
>>>
>>> * AMDGPU_CTX_GUILTY_RESET
>>>
>>> guilty is set to true for this context.
>>>
>>> * AMDGPU_CTX_INNOCENT_RESET
>>>
>>> guilty is not set and vram_lost_counter has changed.
>>>
>>> * AMDGPU_CTX_UNKNOWN_RESET
>>>
>>> guilty is not set and vram_lost_counter has not changed, but
>>> gpu_reset_counter has changed.
>>
>> I don't understand the distinction you're proposing between
>> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
>> cases you're describing should return either
>> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or
>> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>
> I think it'd make more sense if it was called
> "AMDGPU_CTX_UNAFFECTED_RESET".
>
> So:
> - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
> we know that it's the context's fault
> - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,
> and we know that it *wasn't* the context's fault (no context job
> active)
> - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,
> and we don't know who's responsible (this could be returned in the
> unlikely case where context A's gfx job has not yet finished, but
> context B's gfx job has already started; it could be the fault of A,
> it could be the fault of B -- which somehow manages to hang a part of
> the hardware that then prevents A's job from finishing -- or it could
> be both; but it's a bit academic)
> - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
> wasn't affected
>
> This last value would currently just be discarded by Mesa (because we
> should only disturb applications when we have to), but perhaps
> somebody else could find it useful?

Yes, that's what I had in mind as well.

Cause otherwise we would return AMDGPU_CTX_NO_RESET while there actually was a reset and that certainly doesn't sound correct to me.

Regards,
Christian.

>
> Cheers,
> Nicolai

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/26b19de5/attachment-0001.html>