On 13.10.2017 14:04, Liu, Monk wrote: > That's what I suggested, look to know it's agreed Yeah, that's fine with me as well. Cheers, Nicolai > > BR Monk > > -----Original Message----- > From: Koenig, Christian > Sent: 2017å¹´10æ??13æ?¥ 20:01 > To: Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> > Subject: Re: TDR and VRAM lost handling in KMD (v2) > > I think the best approach is to keep it as it is right now and don't change a thing. > > And we add a new IOCTL with a bit more sane return values. E.g. guilty status and VRAM lost status as flags. > > Regards, > Christian. > > Am 13.10.2017 um 13:51 schrieb Liu, Monk: >> Ping Christian & Nicolai >> >> This ctx_query() is a little annoying to me >> >> BR Monk >> >> -----Original Message----- >> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf >> Of Liu, Monk >> Sent: 2017å¹´10æ??13æ?¥ 17:19 >> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Koenig, Christian >> <Christian.Koenig at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, >> Marek <Marek.Olsak at amd.com>; Deucher, Alexander >> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) >> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; >> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >> Subject: RE: TDR and VRAM lost handling in KMD (v2) >> >> Alright, if MESA can handle clone context's VRAM_LOST_COUNTER mismatch >> issue, no need to introduce one more U/K in kmd, >> >> So we have only one issue unresolved and need determined ASAP: >> >> How to modify amdgpu_ctx_query() ?? >> >> Current design won't work later with our discussion on the TDR (v2) right ? unless MESA stop calling this ctx_query() and totally depend on new introduced interface that get latest vram-lost-counter to judge Context healthy. >> >> BR Monk >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: 2017å¹´10æ??13æ?¥ 15:43 >> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >> <Christian.Koenig at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, >> Marek <Marek.Olsak at amd.com>; Deucher, Alexander >> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) >> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; >> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD (v2) >> >> On 13.10.2017 05:57, Liu, Monk wrote: >>>> That sounds sane, but unfortunately might not be possible with the existing IOCTL. Keep in mind that we need to keep backward compatibility here. >>> unfortunately the current scheme on amdgpu_ctx_query() wonâ??t work >>> with TDR feature, which is aim to support vulkan/mesa/close-ogl/radv >>> â?¦ >>> >>> Itâ??s enumeration is too limited to MESA implement â?¦ >>> >>> *Do you have good idea ? both keep the compatibility here and give >>> flexible ?* >>> >>> *looks like we need to add a new amdgpu_ctx_query_2() INTERFACE â?¦.* >>> >>> ·*A new IOCTL added for context:* >>> >>> Void amdgpu_ctx_reinit(){ >>> >>> Ctxâ??vram_lost_counter = adevâ??vram_lost_counter; >>> >>> Ctxâ??reset_counter = adevâ??reset_counter; >>> >>> } >>> >>> >>> Mhm, is there any advantage to just creating a new context? >>> >>> [ML] sorry, this function is wrong, here is my original idea: >>> >>> MESA can create a new ctx based on an old one,like: >>> >>> Create gl-ctx1, >>> >>> Create BO-A under gl-ctx1 â?¦ >>> >>> VRAM LOST >>> >>> Create gl-ctx2 from gl-ctx1 (share list, Iâ??m not familiar with UMD, >>> David Mao an Nicolai can input) >>> >>> Create BO-b UNDER gl-ctx2 â?¦ >>> >>> Submit job upon gl-ctx2, but it can refer to BO-A, >>> >>> With our design, kernel wonâ??t block submit from context2 (from >>> gl-ctx2) since its vram_lost_counter equals to latest adev copy >>> >>> But gl-ctx2 is a clone from gl-ctx1, the only difference is the >>> vram_lost/gpu_reset counter is updated to latest one >>> >>> So logically we should also block submit from gl-ctx2 (mapping to >>> kernel context2), and we failed do so â?¦ >>> >>> Thatâ??s why I want to add a new â??amdgpu_ctx_cloneâ??, which should work like: >>> >>> Int amdgpu_ctx_clone(struct context *original, struct context *new) { >>> >>> New->vram_lost_counter = original->vram_lost_counter; >>> >>> New->gpu_reset_counter = original->gpu_reset_counter; >>> >>> } >> From the Mesa perspective, I don't think we would need to use that as long as we can query the device VRAM lost counter. >> >> Personally, I think it's better for the UMD to build its policy on lower-level primitives such as the VRAM lost counter query. >> >> Cheers, >> Nicolai >> >>> BR Monk >>> >>> *From:*Koenig, Christian >>> *Sent:* 2017å¹´10æ??12æ?¥19:50 >>> *To:* Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai >>> <Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net>; Olsak, >>> Marek <Marek.Olsak at amd.com>; Deucher, Alexander >>> <Alexander.Deucher at amd.com>; Zhou, David(ChunMing) >>> <David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> >>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >>> amd-gfx at lists.freedesktop.org; Filipas, Mario >>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, >>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >>> *Subject:* Re: TDR and VRAM lost handling in KMD (v2) >>> >>> Am 12.10.2017 um 13:37 schrieb Liu, Monk: >>> >>> Hi team >>> >>> Very good, many policy and implement are agreed, looks we only have >>> some arguments in amdgpu_ctx_query(), well I also confused with the >>> current implement of it, â?¹ >>> >>> First, I want to know if you guys agree that we*don't update >>> ctx->reset_counter in amdgpu_ctx_query() *? because I want to make >>> the query result always consistent upon a given context, >>> >>> >>> That sounds like a good idea to me, but I'm not sure if it won't >>> break userspace (I don't think so). Nicolai or Marek need to comment. >>> >>> >>> Second, I want to say that for KERNEL, it shouldn't use the term >>> from MESA or OGL or VULKAN, e.g. kernel shouldn't use >>> AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc... >>> >>> Because that way kernel will be very limited to certain UMD, so I >>> suggest we totally re-name the context status, and each UMD has its >>> own way to map the kernel context's result to >>> gl-context/vk-context/etcâ?¦ >>> >>> >>> Yes, completely agree. >>> >>> >>> Kernel should only provide below **FLAG bits** on a given context: >>> >>> ·Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long as TDR >>> detects a job hang, KMD set the context behind this context as "guilty" >>> >>> ·Define AMDGPU_CTX_STATE_VRAMLOST 0x2 //as long as >>> there is a VRAM lost event hit after this context created, we mark >>> this context "VRAM_LOST", so UMD can say that all BO under this >>> context may lost their content, since kernel have no relationship >>> between context and BO so this is UMD's call to judge if a BO >>> considered "VRAM lost" or not. >>> >>> ·Define AMDGPU_CTX_STATE_RESET 0x3 //as long as there is a gpu >>> reset occurred after context creation, this flag shall be set >>> >>> >>> That sounds sane, but unfortunately might not be possible with the >>> existing IOCTL. Keep in mind that we need to keep backward >>> compatibility here. >>> >>> >>> Sample code: >>> >>> Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) { >>> >>> if (ctx- >vram_lost_counter != adev->vram_lost_counter) >>> >>> out- >status |= AMDGPU_CTX_STATE_VRAM_LOST; >>> >>> if (ctx- >reset_counter != adevâ??reset_counter) { >>> >>> out- >status |= AMDGPU_CTX_STATE_RESET; >>> >>> if (ctx- >guilty == TRUE) >>> >>> out- >status |= >>> AMDGPU_CTX_STATE_GUILTY; >>> >>> } >>> >>> return 0; >>> >>> } >>> >>> For UMD if it found "out.status == 0" means there is no gpu reset >>> even occurred, the context is totally regular, >>> >>> ·*A new IOCTL added for context:* >>> >>> Void amdgpu_ctx_reinit(){ >>> >>> Ctxâ??vram_lost_counter = adevâ??vram_lost_counter; >>> >>> Ctxâ??reset_counter = adevâ??reset_counter; >>> >>> } >>> >>> >>> Mhm, is there any advantage to just creating a new context? >>> >>> Regards, >>> Christian. >>> >>> >>> *if UMD decide *not* to release the "guilty" context but continue >>> using it after UMD acknowledged GPU hang on certain job/context, I >>> suggest UMD call "amdgpu_ctx_reinit()":* >>> >>> That way after you re-init() this context, you can get updated >>> result from "amdgpu_ctx_query", which will probably give you >>> "out.status == 0" as long as no gpu reset/vram lost hit after re-init(). >>> >>> BR Monk >>> >>> -----Original Message----- >>> From: Koenig, Christian >>> Sent: 2017å¹´10æ??12æ?¥18:13 >>> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com> >>> <mailto:Nicolai.Haehnle at amd.com>; Michel Dänzer <michel at daenzer.net> >>> <mailto:michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com> >>> <mailto:Monk.Liu at amd.com>; Olsak, Marek <Marek.Olsak at amd.com> >>> <mailto:Marek.Olsak at amd.com>; Deucher, Alexander >>> <Alexander.Deucher at amd.com> <mailto:Alexander.Deucher at amd.com>; >>> Zhou, David(ChunMing) <David1.Zhou at amd.com> >>> <mailto:David1.Zhou at amd.com>; Mao, David <David.Mao at amd.com> >>> <mailto:David.Mao at amd.com> >>> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com> >>> <mailto:Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org >>> <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario >>> <Mario.Filipas at amd.com> <mailto:Mario.Filipas at amd.com>; Ding, Pixel >>> <Pixel.Ding at amd.com> <mailto:Pixel.Ding at amd.com>; Li, Bingley >>> <Bingley.Li at amd.com> <mailto:Bingley.Li at amd.com>; Jiang, Jerry (SW) >>> <Jerry.Jiang at amd.com> <mailto:Jerry.Jiang at amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD (v2) >>> >>> Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle: >>> >>> > On 12.10.2017 11:35, Michel Dänzer wrote: >>> >>> >> On 12/10/17 11:23 AM, Christian König wrote: >>> >>> >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle: >>> >>> >>>> On 12.10.2017 10:49, Christian König wrote: >>> >>> >>>>>> However, !guilty && ctx->reset_counter != >>> adev->reset_counter >>> >>> >>>>>> does not imply that the context was lost. >>> >>> >>>>>> >>> >>> >>>>>> The way I understand it, we should return >>> >>> >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >>> >>> >>>>>> >>> >>> >>>>>> As far as I understand it, the case of !guilty && >>> >>> >>>>>> ctx->reset_counter != adev->reset_counter && >>> >>> >>>>>> ctx->vram_lost_counter == >>> >>> >>>>>> adev->vram_lost_counter should return >>> AMDGPU_CTX_NO_RESET, >>> >>> >>>>>> adev->because a >>> >>> >>>>>> GPU reset occurred, but it didn't affect our context. >>> >>> >>>>> I disagree on that. >>> >>> >>>>> >>> >>> >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does >>> currently, there >>> >>> >>>>> was a reset but we haven't been causing it. >>> >>> >>>>> >>> >>> >>>>> That the OpenGL extension is specified otherwise is >>> unfortunate, >>> >>> >>>>> but I think we shouldn't use that for the kernel interface here. >>> >>> >>>> Two counterpoints: >>> >>> >>>> >>> >>> >>>> 1. Why should any application care that there was a reset >>> while it >>> >>> >>>> was idle? The OpenGL behavior is what makes sense. >>> >>> >>> >>> >>> >>> The application is certainly not interest if a reset >>> happened or >>> >>> >>> not, but I though that the driver stack might be. >>> >>> >>> >>> >>> >>>> >>> >>> >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything >>> today >>> >>> >>>> because we never return it :) >>> >>> >>>> >>> >>> >>> >>> >>> >>> Good point. >>> >>> >>> >>> >>> >>>> amdgpu_ctx_query only ever returns >>> AMDGPU_CTX_UNKNOWN_RESET, which >>> >>> >>>> is in line with the OpenGL spec: we're conservatively >>> returning >>> >>> >>>> that a reset happened because we don't know whether the >>> context was >>> >>> >>>> affected, and we return UNKNOWN because we also don't know >>> whether >>> >>> >>>> the context was guilty or not. >>> >>> >>>> >>> >>> >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty && >>> >>> >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply >>> a >>> >>> >>>> refinement and improvement of the current, overly >>> conservative >>> >>> >>>> behavior. >>> >>> >>> >>> >>> >>> Ok let's reenumerate what I think the different return >>> values should >>> >>> >>> mean: >>> >>> >>> >>> >>> >>> * AMDGPU_CTX_GUILTY_RESET >>> >>> >>> >>> >>> >>> guilty is set to true for this context. >>> >>> >>> >>> >>> >>> * AMDGPU_CTX_INNOCENT_RESET >>> >>> >>> >>> >>> >>> guilty is not set and vram_lost_counter has changed. >>> >>> >>> >>> >>> >>> * AMDGPU_CTX_UNKNOWN_RESET >>> >>> >>> >>> >>> >>> guilty is not set and vram_lost_counter has not changed, but >>> >>> >>> gpu_reset_counter has changed. >>> >>> >> >>> >>> >> I don't understand the distinction you're proposing between >>> >>> >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I >>> think both >>> >>> >> cases you're describing should return either >>> >>> >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is >>> reliable, or >>> >>> >> AMDGPU_CTX_UNKNOWN_RESET if it's not. >>> >>> > >>> >>> > I think it'd make more sense if it was called >>> >>> > "AMDGPU_CTX_UNAFFECTED_RESET". >>> >>> > >>> >>> > So: >>> >>> > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a >>> reset, and >>> >>> > we know that it's the context's fault >>> >>> > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a >>> reset, >>> >>> > and we know that it *wasn't* the context's fault (no context >>> job >>> >>> > active) >>> >>> > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a >>> reset, >>> >>> > and we don't know who's responsible (this could be returned in >>> the >>> >>> > unlikely case where context A's gfx job has not yet finished, >>> but >>> >>> > context B's gfx job has already started; it could be the fault >>> of A, >>> >>> > it could be the fault of B -- which somehow manages to hang a >>> part of >>> >>> > the hardware that then prevents A's job from finishing -- or >>> it could >>> >>> > be both; but it's a bit academic) >>> >>> > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this >>> context >>> >>> > wasn't affected >>> >>> > >>> >>> > This last value would currently just be discarded by Mesa >>> (because we >>> >>> > should only disturb applications when we have to), but perhaps >>> >>> > somebody else could find it useful? >>> >>> Yes, that's what I had in mind as well. >>> >>> Cause otherwise we would return AMDGPU_CTX_NO_RESET while there >>> actually was a reset and that certainly doesn't sound correct to me. >>> >>> Regards, >>> >>> Christian. >>> >>> > >>> >>> > Cheers, >>> >>> > Nicolai >>> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > >