TDR and VRAM lost handling in KMD (v2)

nicolai.haehnle@xxxxxxx (Nicolai Hähnle) · Fri, 13 Oct 2017 09:43:04 +0200

On 13.10.2017 05:57, Liu, Monk wrote:
>>That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here.
> 
> unfortunately the current scheme on amdgpu_ctx_query() wonâ??t work with 
> TDR feature, which is aim to support vulkan/mesa/close-ogl/radv â?¦
> 
> Itâ??s enumeration is too limited to MESA implement â?¦
> 
> *Do you have good idea ?Â  both keep the compatibility here and give 
> flexible ?*
> 
> *looks like we need to add a new amdgpu_ctx_query_2() INTERFACE â?¦.*
> 
> Â·*A new IOCTL added for context:*
> 
> Void amdgpu_ctx_reinit(){
> 
>  Â Â Â Â Â Â Â  Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
> 
>  Â Â Â Â Â Â Â  Ctxâ??reset_counter = adevâ??reset_counter;
> 
> }
> 
> 
> Mhm, is there any advantage to just creating a new context?
> 
> [ML] sorry, this function is wrong, here is my original idea:
> 
> MESA can create a new ctx based on an old one,like:
> 
> Create gl-ctx1,
> 
> Create BO-A under gl-ctx1 â?¦
> 
> VRAM LOST
> 
> Create gl-ctx2 from gl-ctx1 (share list, Iâ??m not familiar with UMD, 
> David Mao an Nicolai can input)
> 
> Create BO-b UNDER gl-ctx2 â?¦
> 
> Submit job upon gl-ctx2, but it can refer to BO-A,
> 
> With our design, kernel wonâ??t block submit from context2 (from gl-ctx2) 
> since its vram_lost_counter equals to latest adev copy
> 
> But gl-ctx2 is a clone from gl-ctx1, the only difference is the 
> vram_lost/gpu_reset counter is updated to latest one
> 
> So logically we should also block submit from gl-ctx2 (mapping to kernel 
> context2), and we failed do so â?¦
> 
> Thatâ??s why I want to add a new â??amdgpu_ctx_cloneâ??, which should work like:
> 
> Int amdgpu_ctx_clone(struct context *original, struct context *new) {
> 
>  Â Â Â  New->vram_lost_counter = original->vram_lost_counter;
> 
>  Â Â Â  New->gpu_reset_counter = original->gpu_reset_counter;
> 
> }

 From the Mesa perspective, I don't think we would need to use that as 
long as we can query the device VRAM lost counter.

Personally, I think it's better for the UMD to build its policy on 
lower-level primitives such as the VRAM lost counter query.

Cheers,
Nicolai

> 
> BR Monk
> 
> *From:*Koenig, Christian
> *Sent:* 2017å¹´10æ??12æ?¥19:50
> *To:* Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai 
> <Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; Olsak, 
> Marek <Marek.Olsak at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) 
> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; 
> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> *Subject:* Re: TDR and VRAM lost handling in KMD (v2)
> 
> Am 12.10.2017 um 13:37 schrieb Liu, Monk:
> 
>     Hi team
> 
>     Very good, many policy and implement are agreed, looks we only have
>     some arguments in amdgpu_ctx_query(), well I also confused with the
>     current implement of it, â?¹
> 
>     First, I want to know if you guys agree that we*don't update
>     ctx->reset_counter in amdgpu_ctx_query() *? because I want to make
>     the query result always consistent upon a given context,
> 
> 
> That sounds like a good idea to me, but I'm not sure if it won't break 
> userspace (I don't think so). Nicolai or Marek need to comment.
> 
> 
>     Second, I want to say that for KERNEL, it shouldn't use the term
>     from MESA or OGL or VULKAN, e.g. kernel shouldn't use
>     AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...
> 
>     Because that way kernel will be very limited to certain UMD, so I
>     suggest we totally re-name the context status, and each UMD has its
>     own way to map the kernel context's result to gl-context/vk-context/etcâ?¦
> 
> 
> Yes, completely agree.
> 
> 
>     Kernel should only provide below **FLAG bits** on a given context:
> 
>     Â·Define AMDGPU_CTX_STATE_GUILTY 0x1Â Â Â Â Â Â Â Â Â Â Â Â  //as long as TDR
>     detects a job hang, KMD set the context behind this context as "guilty"
> 
>     Â·Define AMDGPU_CTX_STATE_VRAMLOSTÂ Â Â Â Â Â Â Â  0x2Â Â Â Â Â  //as long as
>     there is a VRAM lost event hit after this context created, we mark
>     this context "VRAM_LOST", so UMD can say that all BO under this
>     context may lost their content,Â  since kernel have no relationship
>     between context and BO so this is UMD's call to judge if a BO
>     considered "VRAM lost" or not.
> 
>     Â·Define AMDGPU_CTX_STATE_RESETÂ Â  0x3Â Â Â Â  //as long as there is a gpu
>     reset occurred after context creation, this flag shall be set
> 
> 
> That sounds sane, but unfortunately might not be possible with the 
> existing IOCTL. Keep in mind that we need to keep backward compatibility 
> here.
> 
> 
>     Sample code:
> 
>     Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) {
> 
>      Â Â Â Â Â Â Â  if (ctx- >vram_lost_counter != adev->vram_lost_counter)
> 
>      Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
> 
>     if (ctx- >reset_counter != adevâ??reset_counter) {
> 
>     out- >status |= AMDGPU_CTX_STATE_RESET;
> 
>     if (ctx- >guilty == TRUE)
> 
>      Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  out- >status |= AMDGPU_CTX_STATE_GUILTY;
> 
>     }
> 
>     return 0;
> 
>     }
> 
>     For UMD if it found "out.status == 0" means there is no gpu reset
>     even occurred, the context is totally regular,
> 
>     Â·*A new IOCTL added for context:*
> 
>     Void amdgpu_ctx_reinit(){
> 
>      Â Â Â Â Â Â Â  Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
> 
>      Â Â Â Â Â Â Â  Ctxâ??reset_counter = adevâ??reset_counter;
> 
>     }
> 
> 
> Mhm, is there any advantage to just creating a new context?
> 
> Regards,
> Christian.
> 
> 
>     *if UMD decide *not* to release the "guilty" context but continue
>     using it after UMD acknowledged GPU hang on certain job/context, I
>     suggest UMD call "amdgpu_ctx_reinit()":*
> 
>     That way after you re-init() this context, you can get updated
>     result from "amdgpu_ctx_query", which will probably give you
>     "out.status == 0" as long as no gpu reset/vram lost hit after re-init().
> 
>     BR Monk
> 
>     -----Original Message-----
>     From: Koenig, Christian
>     Sent: 2017å¹´10æ??12æ?¥18:13
>     To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>     <mailto:Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>
>     <mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>
>     <mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>
>     <mailto:Marek.Olsak at amd.com>; Deucher, Alexander
>     <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>;
>     Zhou, David(ChunMing) <David1.Zhou at amd.com>
>     <mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>     <mailto:David.Mao at amd.com>
>     Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>
>     <mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario
>     <Mario.Filipas at amd.com> <mailto:Mario.Filipas at amd.com>; Ding, Pixel
>     <Pixel.Ding at amd.com> <mailto:Pixel.Ding at amd.com>; Li, Bingley
>     <Bingley.Li at amd.com> <mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW)
>     <Jerry.Jiang at amd.com> <mailto:Jerry.Jiang at amd.com>
>     Subject: Re: TDR and VRAM lost handling in KMD (v2)
> 
>     Am 12.10.2017 um 11:44 schrieb Nicolai HÃ¤hnle:
> 
>     > On 12.10.2017 11:35, Michel DÃ¤nzer wrote:
> 
>     >> On 12/10/17 11:23 AM, Christian KÃ¶nig wrote:
> 
>     >>> Am 12.10.2017 um 11:10 schrieb Nicolai HÃ¤hnle:
> 
>     >>>> On 12.10.2017 10:49, Christian KÃ¶nig wrote:
> 
>     >>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter
> 
>     >>>>>> does not imply that the context was lost.
> 
>     >>>>>>
> 
>     >>>>>> The way I understand it, we should return
> 
>     >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
> 
>     >>>>>>
> 
>     >>>>>> As far as I understand it, the case of !guilty &&
> 
>     >>>>>> ctx->reset_counter != adev->reset_counter &&
> 
>     >>>>>> ctx->vram_lost_counter ==
> 
>     >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
> 
>     >>>>>> adev->because a
> 
>     >>>>>> GPU reset occurred, but it didn't affect our context.
> 
>     >>>>> I disagree on that.
> 
>     >>>>>
> 
>     >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
> 
>     >>>>> was a reset but we haven't been causing it.
> 
>     >>>>>
> 
>     >>>>> That the OpenGL extension is specified otherwise is unfortunate,
> 
>     >>>>> but I think we shouldn't use that for the kernel interface here.
> 
>     >>>> Two counterpoints:
> 
>     >>>>
> 
>     >>>> 1. Why should any application care that there was a reset while it
> 
>     >>>> was idle? The OpenGL behavior is what makes sense.
> 
>     >>>
> 
>     >>> The application is certainly not interest if a reset happened or
> 
>     >>> not, but I though that the driver stack might be.
> 
>     >>>
> 
>     >>>>
> 
>     >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
> 
>     >>>> because we never return it :)
> 
>     >>>>
> 
>     >>>
> 
>     >>> Good point.
> 
>     >>>
> 
>     >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
> 
>     >>>> is in line with the OpenGL spec: we're conservatively returning
> 
>     >>>> that a reset happened because we don't know whether the context was
> 
>     >>>> affected, and we return UNKNOWN because we also don't know whether
> 
>     >>>> the context was guilty or not.
> 
>     >>>>
> 
>     >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
> 
>     >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
> 
>     >>>> refinement and improvement of the current, overly conservative
> 
>     >>>> behavior.
> 
>     >>>
> 
>     >>> Ok let's reenumerate what I think the different return values should
> 
>     >>> mean:
> 
>     >>>
> 
>     >>> * AMDGPU_CTX_GUILTY_RESET
> 
>     >>>
> 
>     >>> guilty is set to true for this context.
> 
>     >>>
> 
>     >>> * AMDGPU_CTX_INNOCENT_RESET
> 
>     >>>
> 
>     >>> guilty is not set and vram_lost_counter has changed.
> 
>     >>>
> 
>     >>> * AMDGPU_CTX_UNKNOWN_RESET
> 
>     >>>
> 
>     >>> guilty is not set and vram_lost_counter has not changed, but
> 
>     >>> gpu_reset_counter has changed.
> 
>     >>
> 
>     >> I don't understand the distinction you're proposing between
> 
>     >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
> 
>     >> cases you're describing should return either
> 
>     >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or
> 
>     >> AMDGPU_CTX_UNKNOWN_RESET if it's not.
> 
>     >
> 
>     > I think it'd make more sense if it was called
> 
>     > "AMDGPU_CTX_UNAFFECTED_RESET".
> 
>     >
> 
>     > So:
> 
>     > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
> 
>     > we know that it's the context's fault
> 
>     > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,
> 
>     > and we know that it *wasn't* the context's fault (no context job
> 
>     > active)
> 
>     > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,
> 
>     > and we don't know who's responsible (this could be returned in the
> 
>     > unlikely case where context A's gfx job has not yet finished, but
> 
>     > context B's gfx job has already started; it could be the fault of A,
> 
>     > it could be the fault of B -- which somehow manages to hang a part of
> 
>     > the hardware that then prevents A's job from finishing -- or it could
> 
>     > be both; but it's a bit academic)
> 
>     > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
> 
>     > wasn't affected
> 
>     >
> 
>     > This last value would currently just be discarded by Mesa (because we
> 
>     > should only disturb applications when we have to), but perhaps
> 
>     > somebody else could find it useful?
> 
>     Yes, that's what I had in mind as well.
> 
>     Cause otherwise we would return AMDGPU_CTX_NO_RESET while there
>     actually was a reset and that certainly doesn't sound correct to me.
> 
>     Regards,
> 
>     Christian.
> 
>     >
> 
>     > Cheers,
> 
>     > Nicolai
>