On 12.10.2017 13:49, Christian König wrote: > Am 12.10.2017 um 13:37 schrieb Liu, Monk: >> Hi team >> Very good, many policy and implement are agreed, looks we only have >> some arguments in amdgpu_ctx_query(), well I also confused with the >> current implement of it, â?¹ >> First, I want to know if you guys agree that we*don't update >> ctx->reset_counter in amdgpu_ctx_query() *? because I want to make the >> query result always consistent upon a given context, > > That sounds like a good idea to me, but I'm not sure if it won't break > userspace (I don't think so). Nicolai or Marek need to comment. That should be fine for Mesa. >> Second, I want to say that for KERNEL, it shouldn't use the term from >> MESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET >> to map to GL_INNOCENT_RESET_ARB, etc... >> Because that way kernel will be very limited to certain UMD, so I >> suggest we totally re-name the context status, and each UMD has its >> own way to map the kernel context's result to gl-context/vk-context/etcâ?¦ > > Yes, completely agree. > >> Kernel should only provide below **FLAG bits** on a given context: >> >> * Define AMDGPU_CTX_STATE_GUILTY 0x1            //as long as TDR >> detects a job hang, KMD set the context behind this context as >> "guilty" >> * Define AMDGPU_CTX_STATE_VRAMLOST        0x2     //as long as >> there is a VRAM lost event hit after this context created, we mark >> this context "VRAM_LOST", so UMD can say that all BO under this >> context may lost their content, since kernel have no relationship >> between context and BO so this is UMD's call to judge if a BO >> considered "VRAM lost" or not. >> * Define AMDGPU_CTX_STATE_RESET  0x3    //as long as there is a >> gpu reset occurred after context creation, this flag shall be set >> > > That sounds sane, but unfortunately might not be possible with the > existing IOCTL. Keep in mind that we need to keep backward compatibility > here. Agreed. FWIW, when Mesa sees an unknown (new) value being returned from amdgpu_ctx_query, it treats it the same as AMDGPU_CTX_NO_RESET. Cheers, Nicolai >> Sample code: >> Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, â?¦..) { >>        if (ctx- >vram_lost_counter != adev->vram_lost_counter) >>                out- >status |= AMDGPU_CTX_STATE_VRAM_LOST; >> if (ctx- >reset_counter != adevâ??reset_counter){ >> out- >status |= AMDGPU_CTX_STATE_RESET; >> if (ctx- >guilty == TRUE) >>                        out- >status |= AMDGPU_CTX_STATE_GUILTY; >> } >> return 0; >> } >> For UMD if it found "out.status == 0" means there is no gpu reset even >> occurred, the context is totally regular, >> >> * *A****new IOCTL added for context:* >> >> Void amdgpu_ctx_reinit(){ >>        Ctxâ??vram_lost_counter = adevâ??vram_lost_counter; >>        Ctxâ??reset_counter = adevâ??reset_counter; >> } > > Mhm, is there any advantage to just creating a new context? > > Regards, > Christian. > >> *if**UMD decide *not* to release the "guilty" context but continue >> using it **after**UMD acknowledged GPU hang **on certain job/context, >> I suggest **UMD call "amdgpu_ctx_reinit()":* >> That way after you re-init() this context, you can get updated result >> from "amdgpu_ctx_query", which will probably give you "out.status == >> 0" as long as no gpu reset/vram lost hit after re-init(). >> BR Monk >> -----Original Message----- >> From: Koenig, Christian >> Sent: 2017å¹´10æ??12æ?¥ 18:13 >> To: Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Michel Dänzer >> <michel at daenzer.net>; Liu, Monk <Monk.Liu at amd.com>; Olsak, Marek >> <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; >> Zhou, David(ChunMing) <David1.Zhou at amd.com>; Mao, David >> <David.Mao at amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; >> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD (v2) >> Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle: >> > On 12.10.2017 11:35, Michel Dänzer wrote: >> >> On 12/10/17 11:23 AM, Christian König wrote: >> >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle: >> >>>> On 12.10.2017 10:49, Christian König wrote: >> >>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter >> >>>>>> does not imply that the context was lost. >> >>>>>> >> >>>>>> The way I understand it, we should return >> >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >> >>>>>> >> >>>>>> As far as I understand it, the case of !guilty && >> >>>>>> ctx->reset_counter != adev->reset_counter && >> >>>>>> ctx->vram_lost_counter == >> >>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, >> >>>>>> adev->because a >> >>>>>> GPU reset occurred, but it didn't affect our context. >> >>>>> I disagree on that. >> >>>>> >> >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there >> >>>>> was a reset but we haven't been causing it. >> >>>>> >> >>>>> That the OpenGL extension is specified otherwise is unfortunate, >> >>>>> but I think we shouldn't use that for the kernel interface here. >> >>>> Two counterpoints: >> >>>> >> >>>> 1. Why should any application care that there was a reset while it >> >>>> was idle? The OpenGL behavior is what makes sense. >> >>> >> >>> The application is certainly not interest if a reset happened or >> >>> not, but I though that the driver stack might be. >> >>> >> >>>> >> >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today >> >>>> because we never return it :) >> >>>> >> >>> >> >>> Good point. >> >>> >> >>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which >> >>>> is in line with the OpenGL spec: we're conservatively returning >> >>>> that a reset happened because we don't know whether the context was >> >>>> affected, and we return UNKNOWN because we also don't know whether >> >>>> the context was guilty or not. >> >>>> >> >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty && >> >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a >> >>>> refinement and improvement of the current, overly conservative >> >>>> behavior. >> >>> >> >>> Ok let's reenumerate what I think the different return values should >> >>> mean: >> >>> >> >>> * AMDGPU_CTX_GUILTY_RESET >> >>> >> >>> guilty is set to true for this context. >> >>> >> >>> * AMDGPU_CTX_INNOCENT_RESET >> >>> >> >>> guilty is not set and vram_lost_counter has changed. >> >>> >> >>> * AMDGPU_CTX_UNKNOWN_RESET >> >>> >> >>> guilty is not set and vram_lost_counter has not changed, but >> >>> gpu_reset_counter has changed. >> >> >> >> I don't understand the distinction you're proposing between >> >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both >> >> cases you're describing should return either >> >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or >> >> AMDGPU_CTX_UNKNOWN_RESET if it's not. >> > >> > I think it'd make more sense if it was called >> > "AMDGPU_CTX_UNAFFECTED_RESET". >> > >> > So: >> > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and >> > we know that it's the context's fault >> > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, >> > and we know that it *wasn't* the context's fault (no context job >> > active) >> > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, >> > and we don't know who's responsible (this could be returned in the >> > unlikely case where context A's gfx job has not yet finished, but >> > context B's gfx job has already started; it could be the fault of A, >> > it could be the fault of B -- which somehow manages to hang a part of >> > the hardware that then prevents A's job from finishing -- or it could >> > be both; but it's a bit academic) >> > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context >> > wasn't affected >> > >> > This last value would currently just be discarded by Mesa (because we >> > should only disturb applications when we have to), but perhaps >> > somebody else could find it useful? >> Yes, that's what I had in mind as well. >> Cause otherwise we would return AMDGPU_CTX_NO_RESET while there >> actually was a reset and that certainly doesn't sound correct to me. >> Regards, >> Christian. >> > >> > Cheers, >> > Nicolai > >