Alright, if MESA can handle clone context's VRAM_LOST_COUNTER mismatch issue, no need to introduce one more U/K in kmd, So we have only one issue unresolved and need determined ASAP: How to modify amdgpu_ctx_query() ?? Current design won't work later with our discussion on the TDR (v2) right ? unless MESA stop calling this ctx_query() and totally depend on new introduced interface that get latest vram-lost-counter to judge Context healthy. BR Monk -----Original Message----- From: Haehnle, Nicolai Sent: 2017å¹´10æ??13æ?¥ 15:43 To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> Subject: Re: TDR and VRAM lost handling in KMD (v2) On 13.10.2017 05:57, Liu, Monk wrote: >>That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here. > > unfortunately the current scheme on amdgpu_ctx_query() wonâ??t work with > TDR feature, which is aim to support vulkan/mesa/close-ogl/radv â?¦ > > Itâ??s enumeration is too limited to MESA implement â?¦ > > *Do you have good idea ? both keep the compatibility here and give > flexible ?* > > *looks like we need to add a new amdgpu_ctx_query_2() INTERFACE â?¦.* > > ·*A new IOCTL added for context:* > > Void amdgpu_ctx_reinit(){ > >        Ctxâ??vram_lost_counter = adevâ??vram_lost_counter; > >        Ctxâ??reset_counter = adevâ??reset_counter; > > } > > > Mhm, is there any advantage to just creating a new context? > > [ML] sorry, this function is wrong, here is my original idea: > > MESA can create a new ctx based on an old one,like: > > Create gl-ctx1, > > Create BO-A under gl-ctx1 â?¦ > > VRAM LOST > > Create gl-ctx2 from gl-ctx1 (share list, Iâ??m not familiar with UMD, > David Mao an Nicolai can input) > > Create BO-b UNDER gl-ctx2 â?¦ > > Submit job upon gl-ctx2, but it can refer to BO-A, > > With our design, kernel wonâ??t block submit from context2 (from > gl-ctx2) since its vram_lost_counter equals to latest adev copy > > But gl-ctx2 is a clone from gl-ctx1, the only difference is the > vram_lost/gpu_reset counter is updated to latest one > > So logically we should also block submit from gl-ctx2 (mapping to > kernel context2), and we failed do so â?¦ > > Thatâ??s why I want to add a new â??amdgpu_ctx_cloneâ??, which should work like: > > Int amdgpu_ctx_clone(struct context *original, struct context *new) { > >    New->vram_lost_counter = original->vram_lost_counter; > >    New->gpu_reset_counter = original->gpu_reset_counter; > > } From the Mesa perspective, I don't think we would need to use that as long as we can query the device VRAM lost counter. Personally, I think it's better for the UMD to build its policy on lower-level primitives such as the VRAM lost counter query. Cheers, Nicolai > > BR Monk > > *From:*Koenig, Christian > *Sent:* 2017å¹´10æ??12æ?¥19:50 > *To:* Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai > <Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, > Marek <Marek.Olsak at amd.com>; Deucher, Alexander > <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) > <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> > *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; > amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; > Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang at amd.com> > *Subject:* Re: TDR and VRAM lost handling in KMD (v2) > > Am 12.10.2017 um 13:37 schrieb Liu, Monk: > > Hi team > > Very good, many policy and implement are agreed, looks we only have > some arguments in amdgpu_ctx_query(), well I also confused with the > current implement of it, â?¹ > > First, I want to know if you guys agree that we*don't update > ctx->reset_counter in amdgpu_ctx_query() *? because I want to make > the query result always consistent upon a given context, > > > That sounds like a good idea to me, but I'm not sure if it won't break > userspace (I don't think so). Nicolai or Marek need to comment. > > > Second, I want to say that for KERNEL, it shouldn't use the term > from MESA or OGL or VULKAN, e.g. kernel shouldn't use > AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc... > > Because that way kernel will be very limited to certain UMD, so I > suggest we totally re-name the context status, and each UMD has its > own way to map the kernel context's result to > gl-context/vk-context/etcâ?¦ > > > Yes, completely agree. > > > Kernel should only provide below **FLAG bits** on a given context: > > ·Define AMDGPU_CTX_STATE_GUILTY 0x1            //as long as TDR > detects a job hang, KMD set the context behind this context as "guilty" > > ·Define AMDGPU_CTX_STATE_VRAMLOST        0x2     //as long as > there is a VRAM lost event hit after this context created, we mark > this context "VRAM_LOST", so UMD can say that all BO under this > context may lost their content, since kernel have no relationship > between context and BO so this is UMD's call to judge if a BO > considered "VRAM lost" or not. > > ·Define AMDGPU_CTX_STATE_RESET  0x3    //as long as there is a gpu > reset occurred after context creation, this flag shall be set > > > That sounds sane, but unfortunately might not be possible with the > existing IOCTL. Keep in mind that we need to keep backward > compatibility here. > > > Sample code: > > Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) { > >        if (ctx- >vram_lost_counter != adev->vram_lost_counter) > >                out- >status |= AMDGPU_CTX_STATE_VRAM_LOST; > > if (ctx- >reset_counter != adevâ??reset_counter) { > > out- >status |= AMDGPU_CTX_STATE_RESET; > > if (ctx- >guilty == TRUE) > >                        out- >status |= AMDGPU_CTX_STATE_GUILTY; > > } > > return 0; > > } > > For UMD if it found "out.status == 0" means there is no gpu reset > even occurred, the context is totally regular, > > ·*A new IOCTL added for context:* > > Void amdgpu_ctx_reinit(){ > >        Ctxâ??vram_lost_counter = adevâ??vram_lost_counter; > >        Ctxâ??reset_counter = adevâ??reset_counter; > > } > > > Mhm, is there any advantage to just creating a new context? > > Regards, > Christian. > > > *if UMD decide *not* to release the "guilty" context but continue > using it after UMD acknowledged GPU hang on certain job/context, I > suggest UMD call "amdgpu_ctx_reinit()":* > > That way after you re-init() this context, you can get updated > result from "amdgpu_ctx_query", which will probably give you > "out.status == 0" as long as no gpu reset/vram lost hit after re-init(). > > BR Monk > > -----Original Message----- > From: Koenig, Christian > Sent: 2017å¹´10æ??12æ?¥18:13 > To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com> > <mailto:Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net> > <mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com> > <mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com> > <mailto:Marek.Olsak at amd.com>; Deucher, Alexander > <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>; > Zhou, David(ChunMing) <David1.Zhou at amd.com> > <mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> > <mailto:David.Mao at amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com> > <mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org > <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario > <Mario.Filipas at amd.com> <mailto:Mario.Filipas at amd.com>; Ding, Pixel > <Pixel.Ding at amd.com> <mailto:Pixel.Ding at amd.com>; Li, Bingley > <Bingley.Li at amd.com> <mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW) > <Jerry.Jiang at amd.com> <mailto:Jerry.Jiang at amd.com> > Subject: Re: TDR and VRAM lost handling in KMD (v2) > > Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle: > > > On 12.10.2017 11:35, Michel Dänzer wrote: > > >> On 12/10/17 11:23 AM, Christian König wrote: > > >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle: > > >>>> On 12.10.2017 10:49, Christian König wrote: > > >>>>>> However, !guilty && ctx->reset_counter != > adev->reset_counter > > >>>>>> does not imply that the context was lost. > > >>>>>> > > >>>>>> The way I understand it, we should return > > >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. > > >>>>>> > > >>>>>> As far as I understand it, the case of !guilty && > > >>>>>> ctx->reset_counter != adev->reset_counter && > > >>>>>> ctx->vram_lost_counter == > > >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, > > >>>>>> adev->because a > > >>>>>> GPU reset occurred, but it didn't affect our context. > > >>>>> I disagree on that. > > >>>>> > > >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, > there > > >>>>> was a reset but we haven't been causing it. > > >>>>> > > >>>>> That the OpenGL extension is specified otherwise is > unfortunate, > > >>>>> but I think we shouldn't use that for the kernel interface here. > > >>>> Two counterpoints: > > >>>> > > >>>> 1. Why should any application care that there was a reset > while it > > >>>> was idle? The OpenGL behavior is what makes sense. > > >>> > > >>> The application is certainly not interest if a reset happened > or > > >>> not, but I though that the driver stack might be. > > >>> > > >>>> > > >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything > today > > >>>> because we never return it :) > > >>>> > > >>> > > >>> Good point. > > >>> > > >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, > which > > >>>> is in line with the OpenGL spec: we're conservatively > returning > > >>>> that a reset happened because we don't know whether the > context was > > >>>> affected, and we return UNKNOWN because we also don't know > whether > > >>>> the context was guilty or not. > > >>>> > > >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty && > > >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a > > >>>> refinement and improvement of the current, overly > conservative > > >>>> behavior. > > >>> > > >>> Ok let's reenumerate what I think the different return values > should > > >>> mean: > > >>> > > >>> * AMDGPU_CTX_GUILTY_RESET > > >>> > > >>> guilty is set to true for this context. > > >>> > > >>> * AMDGPU_CTX_INNOCENT_RESET > > >>> > > >>> guilty is not set and vram_lost_counter has changed. > > >>> > > >>> * AMDGPU_CTX_UNKNOWN_RESET > > >>> > > >>> guilty is not set and vram_lost_counter has not changed, but > > >>> gpu_reset_counter has changed. > > >> > > >> I don't understand the distinction you're proposing between > > >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think > both > > >> cases you're describing should return either > > >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, > or > > >> AMDGPU_CTX_UNKNOWN_RESET if it's not. > > > > > > I think it'd make more sense if it was called > > > "AMDGPU_CTX_UNAFFECTED_RESET". > > > > > > So: > > > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a > reset, and > > > we know that it's the context's fault > > > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a > reset, > > > and we know that it *wasn't* the context's fault (no context job > > > active) > > > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a > reset, > > > and we don't know who's responsible (this could be returned in > the > > > unlikely case where context A's gfx job has not yet finished, > but > > > context B's gfx job has already started; it could be the fault > of A, > > > it could be the fault of B -- which somehow manages to hang a > part of > > > the hardware that then prevents A's job from finishing -- or it > could > > > be both; but it's a bit academic) > > > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this > context > > > wasn't affected > > > > > > This last value would currently just be discarded by Mesa > (because we > > > should only disturb applications when we have to), but perhaps > > > somebody else could find it useful? > > Yes, that's what I had in mind as well. > > Cause otherwise we would return AMDGPU_CTX_NO_RESET while there > actually was a reset and that certainly doesn't sound correct to me. > > Regards, > > Christian. > > > > > > Cheers, > > > Nicolai >