TDR and VRAM lost handling in KMD (v2)

christian.koenig@xxxxxxx (Christian König) · Fri, 13 Oct 2017 14:00:37 +0200

I think the best approach is to keep it as it is right now and don't 
change a thing.

And we add a new IOCTL with a bit more sane return values. E.g. guilty 
status and VRAM lost status as flags.

Regards,
Christian.

Am 13.10.2017 um 13:51 schrieb Liu, Monk:
> Ping Christian & Nicolai
>
> This ctx_query() is a little annoying to me
>
> BR Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of Liu, Monk
> Sent: 2017å¹´10æ??13æ?¥ 17:19
> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> Subject: RE: TDR and VRAM lost handling in KMD (v2)
>
> Alright, if MESA can handle clone context's VRAM_LOST_COUNTER mismatch issue, no need to introduce one more U/K in kmd,
>
> So we have only one issue unresolved and need determined ASAP:
>
> How to modify amdgpu_ctx_query() ??
>
> Current design won't work later with our discussion on the TDR (v2) right ? unless MESA stop calling this ctx_query() and totally depend on new introduced interface that get latest vram-lost-counter to judge Context healthy.
>
> BR Monk
>
> -----Original Message-----
> From: Haehnle, Nicolai
> Sent: 2017å¹´10æ??13æ?¥ 15:43
> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
> Subject: Re: TDR and VRAM lost handling in KMD (v2)
>
> On 13.10.2017 05:57, Liu, Monk wrote:
>>> That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here.
>> unfortunately the current scheme on amdgpu_ctx_query() wonâ??t work with
>> TDR feature, which is aim to support vulkan/mesa/close-ogl/radv â?¦
>>
>> Itâ??s enumeration is too limited to MESA implement â?¦
>>
>> *Do you have good idea ?  both keep the compatibility here and give
>> flexible ?*
>>
>> *looks like we need to add a new amdgpu_ctx_query_2() INTERFACE â?¦.*
>>
>> Â·*A new IOCTL added for context:*
>>
>> Void amdgpu_ctx_reinit(){
>>
>>           Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
>>
>>           Ctxâ??reset_counter = adevâ??reset_counter;
>>
>> }
>>
>>
>> Mhm, is there any advantage to just creating a new context?
>>
>> [ML] sorry, this function is wrong, here is my original idea:
>>
>> MESA can create a new ctx based on an old one,like:
>>
>> Create gl-ctx1,
>>
>> Create BO-A under gl-ctx1 â?¦
>>
>> VRAM LOST
>>
>> Create gl-ctx2 from gl-ctx1 (share list, Iâ??m not familiar with UMD,
>> David Mao an Nicolai can input)
>>
>> Create BO-b UNDER gl-ctx2 â?¦
>>
>> Submit job upon gl-ctx2, but it can refer to BO-A,
>>
>> With our design, kernel wonâ??t block submit from context2 (from
>> gl-ctx2) since its vram_lost_counter equals to latest adev copy
>>
>> But gl-ctx2 is a clone from gl-ctx1, the only difference is the
>> vram_lost/gpu_reset counter is updated to latest one
>>
>> So logically we should also block submit from gl-ctx2 (mapping to
>> kernel context2), and we failed do so â?¦
>>
>> Thatâ??s why I want to add a new â??amdgpu_ctx_cloneâ??, which should work like:
>>
>> Int amdgpu_ctx_clone(struct context *original, struct context *new) {
>>
>>       New->vram_lost_counter = original->vram_lost_counter;
>>
>>       New->gpu_reset_counter = original->gpu_reset_counter;
>>
>> }
>   From the Mesa perspective, I don't think we would need to use that as long as we can query the device VRAM lost counter.
>
> Personally, I think it's better for the UMD to build its policy on lower-level primitives such as the VRAM lost counter query.
>
> Cheers,
> Nicolai
>
>> BR Monk
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017å¹´10æ??12æ?¥19:50
>> *To:* Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai
>> <Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>; Olsak,
>> Marek <Marek.Olsak at amd.com>; Deucher, Alexander
>> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing)
>> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>;
>> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>;
>> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>;
>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>
>> *Subject:* Re: TDR and VRAM lost handling in KMD (v2)
>>
>> Am 12.10.2017 um 13:37 schrieb Liu, Monk:
>>
>>      Hi team
>>
>>      Very good, many policy and implement are agreed, looks we only have
>>      some arguments in amdgpu_ctx_query(), well I also confused with the
>>      current implement of it, â?¹
>>
>>      First, I want to know if you guys agree that we*don't update
>>      ctx->reset_counter in amdgpu_ctx_query() *? because I want to make
>>      the query result always consistent upon a given context,
>>
>>
>> That sounds like a good idea to me, but I'm not sure if it won't break
>> userspace (I don't think so). Nicolai or Marek need to comment.
>>
>>
>>      Second, I want to say that for KERNEL, it shouldn't use the term
>>      from MESA or OGL or VULKAN, e.g. kernel shouldn't use
>>      AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...
>>
>>      Because that way kernel will be very limited to certain UMD, so I
>>      suggest we totally re-name the context status, and each UMD has its
>>      own way to map the kernel context's result to
>> gl-context/vk-context/etcâ?¦
>>
>>
>> Yes, completely agree.
>>
>>
>>      Kernel should only provide below **FLAG bits** on a given context:
>>
>>      Â·Define AMDGPU_CTX_STATE_GUILTY 0x1             //as long as TDR
>>      detects a job hang, KMD set the context behind this context as "guilty"
>>
>>      Â·Define AMDGPU_CTX_STATE_VRAMLOST         0x2      //as long as
>>      there is a VRAM lost event hit after this context created, we mark
>>      this context "VRAM_LOST", so UMD can say that all BO under this
>>      context may lost their content,  since kernel have no relationship
>>      between context and BO so this is UMD's call to judge if a BO
>>      considered "VRAM lost" or not.
>>
>>      Â·Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a gpu
>>      reset occurred after context creation, this flag shall be set
>>
>>
>> That sounds sane, but unfortunately might not be possible with the
>> existing IOCTL. Keep in mind that we need to keep backward
>> compatibility here.
>>
>>
>>      Sample code:
>>
>>      Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) {
>>
>>               if (ctx- >vram_lost_counter != adev->vram_lost_counter)
>>
>>                       out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
>>
>>      if (ctx- >reset_counter != adevâ??reset_counter) {
>>
>>      out- >status |= AMDGPU_CTX_STATE_RESET;
>>
>>      if (ctx- >guilty == TRUE)
>>
>>                               out- >status |= AMDGPU_CTX_STATE_GUILTY;
>>
>>      }
>>
>>      return 0;
>>
>>      }
>>
>>      For UMD if it found "out.status == 0" means there is no gpu reset
>>      even occurred, the context is totally regular,
>>
>>      Â·*A new IOCTL added for context:*
>>
>>      Void amdgpu_ctx_reinit(){
>>
>>               Ctxâ??vram_lost_counter = adevâ??vram_lost_counter;
>>
>>               Ctxâ??reset_counter = adevâ??reset_counter;
>>
>>      }
>>
>>
>> Mhm, is there any advantage to just creating a new context?
>>
>> Regards,
>> Christian.
>>
>>
>>      *if UMD decide *not* to release the "guilty" context but continue
>>      using it after UMD acknowledged GPU hang on certain job/context, I
>>      suggest UMD call "amdgpu_ctx_reinit()":*
>>
>>      That way after you re-init() this context, you can get updated
>>      result from "amdgpu_ctx_query", which will probably give you
>>      "out.status == 0" as long as no gpu reset/vram lost hit after re-init().
>>
>>      BR Monk
>>
>>      -----Original Message-----
>>      From: Koenig, Christian
>>      Sent: 2017å¹´10æ??12æ?¥18:13
>>      To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>
>>      <mailto:Nicolai.Haehnle at amd.com>; Michel DÃ¤nzer <michel at daenzer.net>
>>      <mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>
>>      <mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>
>>      <mailto:Marek.Olsak at amd.com>; Deucher, Alexander
>>      <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>;
>>      Zhou, David(ChunMing) <David1.Zhou at amd.com>
>>      <mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com>
>>      <mailto:David.Mao at amd.com>
>>      Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>
>>      <mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org
>>      <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario
>>      <Mario.Filipas at amd.com> <mailto:Mario.Filipas at amd.com>; Ding, Pixel
>>      <Pixel.Ding at amd.com> <mailto:Pixel.Ding at amd.com>; Li, Bingley
>>      <Bingley.Li at amd.com> <mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW)
>>      <Jerry.Jiang at amd.com> <mailto:Jerry.Jiang at amd.com>
>>      Subject: Re: TDR and VRAM lost handling in KMD (v2)
>>
>>      Am 12.10.2017 um 11:44 schrieb Nicolai HÃ¤hnle:
>>
>>      > On 12.10.2017 11:35, Michel DÃ¤nzer wrote:
>>
>>      >> On 12/10/17 11:23 AM, Christian KÃ¶nig wrote:
>>
>>      >>> Am 12.10.2017 um 11:10 schrieb Nicolai HÃ¤hnle:
>>
>>      >>>> On 12.10.2017 10:49, Christian KÃ¶nig wrote:
>>
>>      >>>>>> However, !guilty && ctx->reset_counter !=
>> adev->reset_counter
>>
>>      >>>>>> does not imply that the context was lost.
>>
>>      >>>>>>
>>
>>      >>>>>> The way I understand it, we should return
>>
>>      >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
>>
>>      >>>>>>
>>
>>      >>>>>> As far as I understand it, the case of !guilty &&
>>
>>      >>>>>> ctx->reset_counter != adev->reset_counter &&
>>
>>      >>>>>> ctx->vram_lost_counter ==
>>
>>      >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>>
>>      >>>>>> adev->because a
>>
>>      >>>>>> GPU reset occurred, but it didn't affect our context.
>>
>>      >>>>> I disagree on that.
>>
>>      >>>>>
>>
>>      >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently,
>> there
>>
>>      >>>>> was a reset but we haven't been causing it.
>>
>>      >>>>>
>>
>>      >>>>> That the OpenGL extension is specified otherwise is
>> unfortunate,
>>
>>      >>>>> but I think we shouldn't use that for the kernel interface here.
>>
>>      >>>> Two counterpoints:
>>
>>      >>>>
>>
>>      >>>> 1. Why should any application care that there was a reset
>> while it
>>
>>      >>>> was idle? The OpenGL behavior is what makes sense.
>>
>>      >>>
>>
>>      >>> The application is certainly not interest if a reset happened
>> or
>>
>>      >>> not, but I though that the driver stack might be.
>>
>>      >>>
>>
>>      >>>>
>>
>>      >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything
>> today
>>
>>      >>>> because we never return it :)
>>
>>      >>>>
>>
>>      >>>
>>
>>      >>> Good point.
>>
>>      >>>
>>
>>      >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET,
>> which
>>
>>      >>>> is in line with the OpenGL spec: we're conservatively
>> returning
>>
>>      >>>> that a reset happened because we don't know whether the
>> context was
>>
>>      >>>> affected, and we return UNKNOWN because we also don't know
>> whether
>>
>>      >>>> the context was guilty or not.
>>
>>      >>>>
>>
>>      >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>
>>      >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>
>>      >>>> refinement and improvement of the current, overly
>> conservative
>>
>>      >>>> behavior.
>>
>>      >>>
>>
>>      >>> Ok let's reenumerate what I think the different return values
>> should
>>
>>      >>> mean:
>>
>>      >>>
>>
>>      >>> * AMDGPU_CTX_GUILTY_RESET
>>
>>      >>>
>>
>>      >>> guilty is set to true for this context.
>>
>>      >>>
>>
>>      >>> * AMDGPU_CTX_INNOCENT_RESET
>>
>>      >>>
>>
>>      >>> guilty is not set and vram_lost_counter has changed.
>>
>>      >>>
>>
>>      >>> * AMDGPU_CTX_UNKNOWN_RESET
>>
>>      >>>
>>
>>      >>> guilty is not set and vram_lost_counter has not changed, but
>>
>>      >>> gpu_reset_counter has changed.
>>
>>      >>
>>
>>      >> I don't understand the distinction you're proposing between
>>
>>      >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think
>> both
>>
>>      >> cases you're describing should return either
>>
>>      >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable,
>> or
>>
>>      >> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>>
>>      >
>>
>>      > I think it'd make more sense if it was called
>>
>>      > "AMDGPU_CTX_UNAFFECTED_RESET".
>>
>>      >
>>
>>      > So:
>>
>>      > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a
>> reset, and
>>
>>      > we know that it's the context's fault
>>
>>      > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a
>> reset,
>>
>>      > and we know that it *wasn't* the context's fault (no context job
>>
>>      > active)
>>
>>      > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a
>> reset,
>>
>>      > and we don't know who's responsible (this could be returned in
>> the
>>
>>      > unlikely case where context A's gfx job has not yet finished,
>> but
>>
>>      > context B's gfx job has already started; it could be the fault
>> of A,
>>
>>      > it could be the fault of B -- which somehow manages to hang a
>> part of
>>
>>      > the hardware that then prevents A's job from finishing -- or it
>> could
>>
>>      > be both; but it's a bit academic)
>>
>>      > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this
>> context
>>
>>      > wasn't affected
>>
>>      >
>>
>>      > This last value would currently just be discarded by Mesa
>> (because we
>>
>>      > should only disturb applications when we have to), but perhaps
>>
>>      > somebody else could find it useful?
>>
>>      Yes, that's what I had in mind as well.
>>
>>      Cause otherwise we would return AMDGPU_CTX_NO_RESET while there
>>      actually was a reset and that certainly doesn't sound correct to me.
>>
>>      Regards,
>>
>>      Christian.
>>
>>      >
>>
>>      > Cheers,
>>
>>      > Nicolai
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx