Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle: > On 12.10.2017 10:49, Christian König wrote: >>> However, !guilty && ctx->reset_counter != adev->reset_counter does >>> not imply that the context was lost. >>> >>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET >>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >>> >>> As far as I understand it, the case of !guilty && ctx->reset_counter >>> != adev->reset_counter && ctx->vram_lost_counter == >>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a >>> GPU reset occurred, but it didn't affect our context. >> I disagree on that. >> >> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there >> was a reset but we haven't been causing it. >> >> That the OpenGL extension is specified otherwise is unfortunate, but >> I think we shouldn't use that for the kernel interface here. > Two counterpoints: > > 1. Why should any application care that there was a reset while it was > idle? The OpenGL behavior is what makes sense. The application is certainly not interest if a reset happened or not, but I though that the driver stack might be. > > 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today > because we never return it :) > Good point. > amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is > in line with the OpenGL spec: we're conservatively returning that a > reset happened because we don't know whether the context was affected, > and we return UNKNOWN because we also don't know whether the context > was guilty or not. > > Returning AMDGPU_CTX_NO_RESET in the case of !guilty && > ctx->vram_lost_counter == adev->vram_lost_counter is simply a > refinement and improvement of the current, overly conservative behavior. Ok let's reenumerate what I think the different return values should mean: * AMDGPU_CTX_GUILTY_RESET guilty is set to true for this context. * AMDGPU_CTX_INNOCENT_RESET guilty is not set and vram_lost_counter has changed. * AMDGPU_CTX_UNKNOWN_RESET guilty is not set and vram_lost_counter has not changed, but gpu_reset_counter has changed. * AMDGPU_CTX_NO_RESET There wasn't a reset. So neither guilty is set, nor gpu_reset_counter nor vram_lost_counter has changed. If we get to a point where we are certain that a GPU reset without loosing VRAM is harmless we can drop using gpu_reset_counter and also return AMDGPU_CTX_NO_RESET if we are certain that an application didn't do anything nasty. But till then I would rather say we should keep this. Regards, Christian. > > Cheers, > Nicolai > > >> >> Regards, >> Christian. >> >> Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle: >>> I think we should stick to the plan where kernel contexts stay >>> "stuck" after a GPU reset. This is the most robust behavior for the >>> kernel. >>> >>> Even if the OpenGL spec says that an OpenGL context can be re-used >>> without destroying and re-creating it, the UMD can take care of >>> re-creating the kernel context. >>> >>> This means amdgpu_ctx_query should *not* reset ctx->reset_counter. >>> >>> Cheers, >>> Nicolai >>> >>> >>> On 12.10.2017 10:41, Nicolai Hähnle wrote: >>>> Hi Monk, >>>> >>>> Thanks for the summary. Most of it looks good to me, though I can't >>>> speak to all the kernel internals. >>>> >>>> Just some comments: >>>> >>>> On 12.10.2017 10:03, Liu, Monk wrote: >>>>> lFor cs_submit() IOCTL: >>>>> >>>>> 1.check if current ctx been marked â??*guilty*â??and return >>>>> â??*ECANCELED*â?? if so. >>>>> >>>>> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and >>>>> return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != >>>>> job->*vram_lost_counter* (Christian already submitted this patch) >>>>> >>>>> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch >>>>> ? that way UMD know this context is under â??device lostâ?? >>>> >>>> My plan for UMD is to always query the VRAM lost counter when any >>>> kind of context lost situation is detected. So cs_submit() should >>>> return an error in this situation, but it could just be ECANCELED. >>>> We don't need to distinguish between different types of errors here. >>>> >>>> >>>>> lIntroduce a new IOCTL to let UMD query latest >>>>> adev->*vram_lost_counter*: >>>> >>>> Christian already sent a patch for this. >>>> >>>> >>>>> lFor amdgpu_ctx_query(): >>>>> >>>>> n*Donâ??t update ctx->reset_counter when querying this function, >>>>> otherwise the query result is not consistent * >>>> >>>> Hmm. I misremembered part of the spec, see below. >>>> >>>> >>>>> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the >>>>> ctx is â??*guilty*â??, no need to check â??ctx->reset_counterâ?? >>>> >>>> Agreed. >>>> >>>> >>>>> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if >>>>> the ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter * >>>> >>>> I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the >>>> corresponding enums in user space APIs. I don't know how it works >>>> in Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB >>>> means that the context was lost. >>>> >>>> However, !guilty && ctx->reset_counter != adev->reset_counter does >>>> not imply that the context was lost. >>>> >>>> The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET >>>> if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. >>>> >>>> As far as I understand it, the case of !guilty && >>>> ctx->reset_counter != adev->reset_counter && ctx->vram_lost_counter >>>> == adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, >>>> because a GPU reset occurred, but it didn't affect our context. >>>> >>>> I unfortunately noticed another subtlety while re-reading the >>>> OpenGL spec. OpenGL says that the OpenGL context itself does *not* >>>> have to be re-created in order to recover from the reset. >>>> Re-creating all objects in the context is sufficient. >>>> >>>> I believe this is the original motivation for why >>>> amdgpu_ctx_query() will reset the ctx->reset_counter. >>>> >>>> For Mesa, it's still okay if the kernel keeps blocking submissions >>>> as we can just recreate the kernel context. But OrcaGL is also >>>> affected. >>>> >>>> Does anybody know off-hand where the relevant parts of the Vulkan >>>> spec are? I didn't actually find anything in a quick search. >>>> >>>> >>>> [snip] >>>>> For UMD behavior we still have something need to consider: >>>>> >>>>> If MESA creates a new context from an old context (share list?? >>>>> Iâ??m not familiar with UMD , David Mao shall have some discuss on >>>>> it with Nicolai), the new created contextâ??s vram_lost_counter >>>>> >>>>> And reset_counter shall all be ported from that old context , >>>>> otherwise CS_SUBMIT will not block it which isnâ??t correct >>>> >>>> The kernel doesn't have to do anything for this, it is entirely the >>>> UMD's responsibility. All UMD needs from KMD is the function for >>>> querying the vram_lost_counter. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> >>>>> Need your feedback, thx >>>>> >>>>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On >>>>> Behalf Of *Liu, Monk >>>>> *Sent:* 2017å¹´10æ??11æ?¥13:34 >>>>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, >>>>> Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek >>>>> <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> >>>>> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >>>>> amd-gfx at lists.freedesktop.org; Filipas, Mario >>>>> <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, >>>>> Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >>>>> *Subject:* TDR and VRAM lost handling in KMD: >>>>> >>>>> Hi Christian & Nicolai, >>>>> >>>>> We need to achieve some agreements on what should MESA/UMD do and >>>>> what should KMD do, *please give your comments with **â??okayâ??or >>>>> â??Noâ??and your idea on below items,* >>>>> >>>>> lWhen a job timed out (set from lockup_timeout kernel parameter), >>>>> What KMD should do in TDR routine : >>>>> >>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>> (*gpu_reset_counter* is used to force vm flush after GPU reset, >>>>> out of this threadâ??s scope so no more discussion on it) >>>>> >>>>> 2.Set its fence error status to â??*ETIME*â??, >>>>> >>>>> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >>>>> >>>>> 4.Kick out this job from schedulerâ??s mirror list, so this job >>>>> wonâ??t get re-scheduled to ring anymore. >>>>> >>>>> 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all >>>>> their fence status to â??*ECANCELED*â?? >>>>> >>>>> *6.*Force signal all fences that get kicked out by above two >>>>> steps,*otherwise UMD will block forever if waiting on those fences* >>>>> >>>>> 7.Do gpu reset, which is can be some callbacks to let bare-metal >>>>> and SR-IOV implement with their favor style >>>>> >>>>> 8.After reset, KMD need to aware if the VRAM lost happens or not, >>>>> bare-metal can implement some function to judge, while for SR-IOV >>>>> I prefer to read it from GIM side (for initial version we consider >>>>> itâ??s always VRAM lost, till GIM side change aligned) >>>>> >>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>> >>>>> a)Update adev->*vram_lost_counter*, >>>>> >>>>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>>>> fence status to â??*ECANCELDED*â?? >>>>> >>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>> (optional, do we need it ???) >>>>> >>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >>>>> >>>>> lFor cs_wait() IOCTL: >>>>> >>>>> After it found fence signaled, it should check with >>>>> *â??dma_fence_get_statusâ?? *to see if there is error there, >>>>> >>>>> And return the error status of fence >>>>> >>>>> lFor cs_wait_fences() IOCTL: >>>>> >>>>> Similar with above approach >>>>> >>>>> lFor cs_submit() IOCTL: >>>>> >>>>> It need to check if current ctx been marked as â??*guilty*â??and >>>>> return â??*ECANCELED*â??if so >>>>> >>>>> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>> >>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>> enumerate physical device, and deny all >>>>> >>>>> gl-context from submitting if the counter queried bigger than that >>>>> one cached in physical device. (looks a little overkill to me, but >>>>> easy to implement ) >>>>> >>>>> UMD can also return error to APP when creating gl-context if found >>>>> current queried*vram_lost_counter *bigger than that one cached in >>>>> physical device. >>>>> >>>>> BTW: I realized that gl-context is a little different with >>>>> kernelâ??s context. Because for kernel. BO is not related with >>>>> context but only with FD, while in UMD, BO have a backend >>>>> >>>>> gl-context, so block submitting in UMD layer is also needed >>>>> although KMD will do its job as bottom line >>>>> >>>>> lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD >>>>> take the control of robust extension feature, it will be UMDâ??s >>>>> call to move, KMD only deny â??guiltyâ??context from submitting >>>>> >>>>> Need your feedback, thx >>>>> >>>>> Weâ??d better make TDR feature landed ASAP >>>>> >>>>> BR Monk >>>>> >>>> >>> >> >