On 12.10.2017 10:49, Christian König wrote: >> However, !guilty && ctx->reset_counter != adev->reset_counter does not >> imply that the context was lost. >> >> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if >> !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >> >> As far as I understand it, the case of !guilty && ctx->reset_counter >> != adev->reset_counter && ctx->vram_lost_counter == >> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a >> GPU reset occurred, but it didn't affect our context. > I disagree on that. > > AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there was a > reset but we haven't been causing it. > > That the OpenGL extension is specified otherwise is unfortunate, but I > think we shouldn't use that for the kernel interface here. Two counterpoints: 1. Why should any application care that there was a reset while it was idle? The OpenGL behavior is what makes sense. 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today because we never return it :) amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is in line with the OpenGL spec: we're conservatively returning that a reset happened because we don't know whether the context was affected, and we return UNKNOWN because we also don't know whether the context was guilty or not. Returning AMDGPU_CTX_NO_RESET in the case of !guilty && ctx->vram_lost_counter == adev->vram_lost_counter is simply a refinement and improvement of the current, overly conservative behavior. Cheers, Nicolai > > Regards, > Christian. > > Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle: >> I think we should stick to the plan where kernel contexts stay "stuck" >> after a GPU reset. This is the most robust behavior for the kernel. >> >> Even if the OpenGL spec says that an OpenGL context can be re-used >> without destroying and re-creating it, the UMD can take care of >> re-creating the kernel context. >> >> This means amdgpu_ctx_query should *not* reset ctx->reset_counter. >> >> Cheers, >> Nicolai >> >> >> On 12.10.2017 10:41, Nicolai Hähnle wrote: >>> Hi Monk, >>> >>> Thanks for the summary. Most of it looks good to me, though I can't >>> speak to all the kernel internals. >>> >>> Just some comments: >>> >>> On 12.10.2017 10:03, Liu, Monk wrote: >>>> lFor cs_submit() IOCTL: >>>> >>>> 1.check if current ctx been marked â??*guilty*â??and return >>>> â??*ECANCELED*â??  if so. >>>> >>>> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and >>>> return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != >>>> job->*vram_lost_counter* (Christian already submitted this patch) >>>> >>>> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch ? >>>> that way UMD know this context is under â??device lostâ?? >>> >>> My plan for UMD is to always query the VRAM lost counter when any >>> kind of context lost situation is detected. So cs_submit() should >>> return an error in this situation, but it could just be ECANCELED. We >>> don't need to distinguish between different types of errors here. >>> >>> >>>> lIntroduce a new IOCTL to let UMD query latest >>>> adev->*vram_lost_counter*: >>> >>> Christian already sent a patch for this. >>> >>> >>>> lFor amdgpu_ctx_query(): >>>> >>>> n*Donâ??t update ctx->reset_counter when querying this function, >>>> otherwise the query result is not consistent * >>> >>> Hmm. I misremembered part of the spec, see below. >>> >>> >>>> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the ctx >>>> is â??*guilty*â??, no need to check â??ctx->reset_counterâ?? >>> >>> Agreed. >>> >>> >>>> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if the >>>> ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter * >>> >>> I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the >>> corresponding enums in user space APIs. I don't know how it works in >>> Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means >>> that the context was lost. >>> >>> However, !guilty && ctx->reset_counter != adev->reset_counter does >>> not imply that the context was lost. >>> >>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET >>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >>> >>> As far as I understand it, the case of !guilty && ctx->reset_counter >>> != adev->reset_counter && ctx->vram_lost_counter == >>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a >>> GPU reset occurred, but it didn't affect our context. >>> >>> I unfortunately noticed another subtlety while re-reading the OpenGL >>> spec. OpenGL says that the OpenGL context itself does *not* have to >>> be re-created in order to recover from the reset. Re-creating all >>> objects in the context is sufficient. >>> >>> I believe this is the original motivation for why amdgpu_ctx_query() >>> will reset the ctx->reset_counter. >>> >>> For Mesa, it's still okay if the kernel keeps blocking submissions as >>> we can just recreate the kernel context. But OrcaGL is also affected. >>> >>> Does anybody know off-hand where the relevant parts of the Vulkan >>> spec are? I didn't actually find anything in a quick search. >>> >>> >>> [snip] >>>> For UMD behavior we still have something need to consider: >>>> >>>> If MESA creates a new context from an old context (share list?? Iâ??m >>>> not familiar with UMD , David Mao shall have some discuss on it with >>>> Nicolai), the new created contextâ??s vram_lost_counter >>>> >>>> And reset_counter shall all be ported from that old context , >>>> otherwise CS_SUBMIT will not block it which isnâ??t correct >>> >>> The kernel doesn't have to do anything for this, it is entirely the >>> UMD's responsibility. All UMD needs from KMD is the function for >>> querying the vram_lost_counter. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> >>>> Need your feedback, thx >>>> >>>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On >>>> Behalf Of *Liu, Monk >>>> *Sent:* 2017å¹´10æ??11æ?¥13:34 >>>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai >>>> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>>> Deucher, Alexander <Alexander.Deucher at amd.com> >>>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >>>> amd-gfx at lists.freedesktop.org; Filipas, Mario >>>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, >>>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >>>> *Subject:* TDR and VRAM lost handling in KMD: >>>> >>>> Hi Christian & Nicolai, >>>> >>>> We need to achieve some agreements on what should MESA/UMD do and >>>> what should KMD do, *please give your comments with **â??okayâ??or >>>> â??Noâ??and your idea on below items,* >>>> >>>> lWhen a job timed out (set from lockup_timeout kernel parameter), >>>> What KMD should do in TDR routine : >>>> >>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >>>> of this threadâ??s scope so no more discussion on it) >>>> >>>> 2.Set its fence error status to â??*ETIME*â??, >>>> >>>> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >>>> >>>> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t >>>> get re-scheduled to ring anymore. >>>> >>>> 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all >>>> their fence status to â??*ECANCELED*â?? >>>> >>>> *6.*Force signal all fences that get kicked out by above two >>>> steps,*otherwise UMD will block forever if waiting on those fences* >>>> >>>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >>>> SR-IOV implement with their favor style >>>> >>>> 8.After reset, KMD need to aware if the VRAM lost happens or not, >>>> bare-metal can implement some function to judge, while for SR-IOV I >>>> prefer to read it from GIM side (for initial version we consider >>>> itâ??s always VRAM lost, till GIM side change aligned) >>>> >>>> 9.If VRAM lost not hit, continue, otherwise: >>>> >>>> a)Update adev->*vram_lost_counter*, >>>> >>>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since >>>> VRAM lost actually ruins all VRAM contents >>>> >>>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>>> fence status to â??*ECANCELDED*â?? >>>> >>>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >>>> do we need it ???) >>>> >>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >>>> >>>> lFor cs_wait() IOCTL: >>>> >>>> After it found fence signaled, it should check with >>>> *â??dma_fence_get_statusâ?? *to see if there is error there, >>>> >>>> And return the error status of fence >>>> >>>> lFor cs_wait_fences() IOCTL: >>>> >>>> Similar with above approach >>>> >>>> lFor cs_submit() IOCTL: >>>> >>>> It need to check if current ctx been marked as â??*guilty*â??and return >>>> â??*ECANCELED*â??if so >>>> >>>> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*: >>>> >>>> This way, UMD can also block app from submitting, like @Nicolai >>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>> enumerate physical device, and deny all >>>> >>>> gl-context from submitting if the counter queried bigger than that >>>> one cached in physical device. (looks a little overkill to me, but >>>> easy to implement ) >>>> >>>> UMD can also return error to APP when creating gl-context if found >>>> current queried*vram_lost_counter *bigger than that one cached in >>>> physical device. >>>> >>>> BTW: I realized that gl-context is a little different with kernelâ??s >>>> context. Because for kernel. BO is not related with context but only >>>> with FD, while in UMD, BO have a backend >>>> >>>> gl-context, so block submitting in UMD layer is also needed although >>>> KMD will do its job as bottom line >>>> >>>> lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD take >>>> the control of robust extension feature, it will be UMDâ??s call to >>>> move, KMD only deny â??guiltyâ??context from submitting >>>> >>>> Need your feedback, thx >>>> >>>> Weâ??d better make TDR feature landed ASAP >>>> >>>> BR Monk >>>> >>> >> >