I think we should stick to the plan where kernel contexts stay "stuck" after a GPU reset. This is the most robust behavior for the kernel. Even if the OpenGL spec says that an OpenGL context can be re-used without destroying and re-creating it, the UMD can take care of re-creating the kernel context. This means amdgpu_ctx_query should *not* reset ctx->reset_counter. Cheers, Nicolai On 12.10.2017 10:41, Nicolai Hähnle wrote: > Hi Monk, > > Thanks for the summary. Most of it looks good to me, though I can't > speak to all the kernel internals. > > Just some comments: > > On 12.10.2017 10:03, Liu, Monk wrote: >> lFor cs_submit() IOCTL: >> >> 1.check if current ctx been marked â??*guilty*â??and return â??*ECANCELED*â?? >>   if so. >> >> 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and >> return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != >> job->*vram_lost_counter* (Christian already submitted this patch) >> >> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch ? >> that way UMD know this context is under â??device lostâ?? > > My plan for UMD is to always query the VRAM lost counter when any kind > of context lost situation is detected. So cs_submit() should return an > error in this situation, but it could just be ECANCELED. We don't need > to distinguish between different types of errors here. > > >> lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*: > > Christian already sent a patch for this. > > >> lFor amdgpu_ctx_query(): >> >> n*Donâ??t update ctx->reset_counter when querying this function, >> otherwise the query result is not consistent * > > Hmm. I misremembered part of the spec, see below. > > >> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the ctx >> is â??*guilty*â??, no need to check â??ctx->reset_counterâ?? > > Agreed. > > >> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if the >> ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter * > > I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the > corresponding enums in user space APIs. I don't know how it works in > Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means > that the context was lost. > > However, !guilty && ctx->reset_counter != adev->reset_counter does not > imply that the context was lost. > > The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if > !guilty && ctx->vram_lost_counter != adev->vram_lost_counter. > > As far as I understand it, the case of !guilty && ctx->reset_counter != > adev->reset_counter && ctx->vram_lost_counter == adev->vram_lost_counter > should return AMDGPU_CTX_NO_RESET, because a GPU reset occurred, but it > didn't affect our context. > > I unfortunately noticed another subtlety while re-reading the OpenGL > spec. OpenGL says that the OpenGL context itself does *not* have to be > re-created in order to recover from the reset. Re-creating all objects > in the context is sufficient. > > I believe this is the original motivation for why amdgpu_ctx_query() > will reset the ctx->reset_counter. > > For Mesa, it's still okay if the kernel keeps blocking submissions as we > can just recreate the kernel context. But OrcaGL is also affected. > > Does anybody know off-hand where the relevant parts of the Vulkan spec > are? I didn't actually find anything in a quick search. > > > [snip] >> For UMD behavior we still have something need to consider: >> >> If MESA creates a new context from an old context (share list?? Iâ??m >> not familiar with UMD , David Mao shall have some discuss on it with >> Nicolai), the new created contextâ??s vram_lost_counter >> >> And reset_counter shall all be ported from that old context , >> otherwise CS_SUBMIT will not block it which isnâ??t correct > > The kernel doesn't have to do anything for this, it is entirely the > UMD's responsibility. All UMD needs from KMD is the function for > querying the vram_lost_counter. > > Cheers, > Nicolai > > >> >> Need your feedback, thx >> >> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On >> Behalf Of *Liu, Monk >> *Sent:* 2017å¹´10æ??11æ?¥13:34 >> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai >> <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >> Deucher, Alexander <Alexander.Deucher at amd.com> >> *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; >> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >> *Subject:* TDR and VRAM lost handling in KMD: >> >> Hi Christian & Nicolai, >> >> We need to achieve some agreements on what should MESA/UMD do and what >> should KMD do, *please give your comments with **â??okayâ??or â??Noâ??and your >> idea on below items,* >> >> lWhen a job timed out (set from lockup_timeout kernel parameter), What >> KMD should do in TDR routine : >> >> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >> (*gpu_reset_counter* is used to force vm flush after GPU reset, out of >> this threadâ??s scope so no more discussion on it) >> >> 2.Set its fence error status to â??*ETIME*â??, >> >> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >> >> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t >> get re-scheduled to ring anymore. >> >> 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all >> their fence status to â??*ECANCELED*â?? >> >> *6.*Force signal all fences that get kicked out by above two >> steps,*otherwise UMD will block forever if waiting on those fences* >> >> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >> SR-IOV implement with their favor style >> >> 8.After reset, KMD need to aware if the VRAM lost happens or not, >> bare-metal can implement some function to judge, while for SR-IOV I >> prefer to read it from GIM side (for initial version we consider itâ??s >> always VRAM lost, till GIM side change aligned) >> >> 9.If VRAM lost not hit, continue, otherwise: >> >> a)Update adev->*vram_lost_counter*, >> >> b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since VRAM >> lost actually ruins all VRAM contents >> >> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence >> status to â??*ECANCELDED*â?? >> >> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do >> we need it ???) >> >> 11.Re-schedule all JOBs remains in mirror list to ring again and >> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >> >> lFor cs_wait() IOCTL: >> >> After it found fence signaled, it should check with >> *â??dma_fence_get_statusâ?? *to see if there is error there, >> >> And return the error status of fence >> >> lFor cs_wait_fences() IOCTL: >> >> Similar with above approach >> >> lFor cs_submit() IOCTL: >> >> It need to check if current ctx been marked as â??*guilty*â??and return >> â??*ECANCELED*â??if so >> >> lIntroduce a new IOCTL to let UMD query *vram_lost_counter*: >> >> This way, UMD can also block app from submitting, like @Nicolai >> mentioned, we can cache one copy of *vram_lost_counter* when enumerate >> physical device, and deny all >> >> gl-context from submitting if the counter queried bigger than that one >> cached in physical device. (looks a little overkill to me, but easy to >> implement ) >> >> UMD can also return error to APP when creating gl-context if found >> current queried*vram_lost_counter *bigger than that one cached in >> physical device. >> >> BTW: I realized that gl-context is a little different with kernelâ??s >> context. Because for kernel. BO is not related with context but only >> with FD, while in UMD, BO have a backend >> >> gl-context, so block submitting in UMD layer is also needed although >> KMD will do its job as bottom line >> >> lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD take >> the control of robust extension feature, it will be UMDâ??s call to >> move, KMD only deny â??guiltyâ??context from submitting >> >> Need your feedback, thx >> >> Weâ??d better make TDR feature landed ASAP >> >> BR Monk >> >