Yeah, that sounds like a plan to me. Going to commit the patches I've already done in a minute, they still seem to implement quite a bit of what we want to do here. Regards, Christian. Am 12.10.2017 um 10:03 schrieb Liu, Monk: > > V2 summary > > Hi team > > *please give your comments* > > lWhen a job timed out (set from lockup_timeout kernel parameter), What > KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first > > 2.Set its fence error status to â??*ECANCELED*â??, > > 3.Find the *context* behind this job, and set this *context* as > â??*guilty*â??(will have a new member field in context structure â?? *bool > guilty*) > > a)There will be â??*bool * guilty*â?? in entity structure, which points to > its father contextâ??s member â?? â??*bool guiltyâ?? *when context > initialized**, so no matter we get context or entity, we always know > if it is â??guiltyâ?? > > b)For kernel entity that used for VM updates, there is no context back > it, so kernel entityâ??s â??bool *guiltyâ?? always â??NULLâ??. > > c)The idea to skip the whole context is for consistence consideration, > because weâ??ll fake signal the hang job in job_run(), so all jobs in > its context shall be dropped otherwise either bad drawing/computing > results or more GPU hang. > > ** > > 4.Do GPU reset, which is can be some callbacks to let bare-metal and > SR-IOV implement with their favor style > > 5.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV I > prefer to read it from GIM side (for initial version we consider itâ??s > always VRAM lost, till GIM side change aligned) > > 6.If VRAM lost hit, update adev->*vram_lost_counter*. > > 7.Do GTT recovery and shadow buffer recovery. > > 8.Re-schedule all JOBs in mirror list and restart scheduler > > lFor GPU scheduler function --- job_run() > > 1.Before schedule a job to ring, checks if job->*vram_lost_counter* == > adev->*vram_lost_counter*, and drop this job if mismatch > > 2.Before schedule a job to ring, checks if job->entity->*guilty* is > NULL or not, *and drop this job if (guilty!=NULL && *guilty == TRUE)* > > 3.if a job is dropped: > > a)set jobâ??s sched_fence status to â??*ECANCELED*â?? > > b)fake/force signal jobâ??s hw fence (no need to set hw fenceâ??s status) > > lFor cs_wait() IOCTL: > > After it found fence signaled, it should check if there is error on > this fence and return the error status of this fence > > lFor cs_wait_fences() IOCTL: > > Similar with above approach > > lFor cs_submit() IOCTL: > > 1.check if current ctx been marked â??*guilty*â??and return â??*ECANCELED*â?? > if so. > > 2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and > return â??*ECANCELED*â?? if ctx->*vram_lost_counter* != > job->*vram_lost_counter* (Christian already submitted this patch) > > a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch ? > that way UMD know this context is under â??device lostâ?? > > lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*: > > lFor amdgpu_ctx_query(): > > n*Donâ??t update ctx->reset_counter when querying this function, > otherwise the query result is not consistent * > > nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the ctx > is â??*guilty*â??, no need to check â??ctx->reset_counterâ?? > > nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ?? *if the > ctx isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter * > > nSet out->state.reset_status to â??AMDGPU_CTX_NO_RESETâ?? if > ctx->reset_counter == adev->reset_counter > > nSet out->state.flags to â??AMDGPU_CTX_FLAG_VRAM_LOSTâ?? if > ctx->vram_lost_counter != adev->vram_lost_counter > > udiscussion: can we return â??ENODEVâ?? for amdgpu_ctx_query() if > ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know > this context is under â??device lostâ?? > > nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or > its flags is â??AMDGPU_CTX_FLAG_VRAM_LOSTâ?? > > For UMD behavior we still have something need to consider: > > If MESA creates a new context from an old context (share list?? Iâ??m > not familiar with UMD , David Mao shall have some discuss on it with > Nicolai), the new created contextâ??s vram_lost_counter > > And reset_counter shall all be ported from that old context , > otherwise CS_SUBMIT will not block it which isnâ??t correct > > Need your feedback, thx > > *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On > Behalf Of *Liu, Monk > *Sent:* 2017å¹´10æ??11æ?¥13:34 > *To:* Koenig, Christian <Christian.Koenig at amd.com>; Haehnle, Nicolai > <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; > Deucher, Alexander <Alexander.Deucher at amd.com> > *Cc:* Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; > amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; > Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang at amd.com> > *Subject:* TDR and VRAM lost handling in KMD: > > Hi Christian & Nicolai, > > We need to achieve some agreements on what should MESA/UMD do and what > should KMD do, *please give your comments with **â??okayâ??or â??Noâ??and your > idea on below items,* > > lWhen a job timed out (set from lockup_timeout kernel parameter), What > KMD should do in TDR routine : > > 1.Update adev->*gpu_reset_counter*, and stop scheduler first, > (*gpu_reset_counter* is used to force vm flush after GPU reset, out of > this threadâ??s scope so no more discussion on it) > > 2.Set its fence error status to â??*ETIME*â??, > > 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? > > 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t > get re-scheduled to ring anymore. > > 5.Kick out all jobs in this â??guiltyâ??ctxâ??s KFIFO queue, and set all > their fence status to â??*ECANCELED*â?? > > *6.*Force signal all fences that get kicked out by above two > steps,*otherwise UMD will block forever if waiting on those fences* > > 7.Do gpu reset, which is can be some callbacks to let bare-metal and > SR-IOV implement with their favor style > > 8.After reset, KMD need to aware if the VRAM lost happens or not, > bare-metal can implement some function to judge, while for SR-IOV I > prefer to read it from GIM side (for initial version we consider itâ??s > always VRAM lost, till GIM side change aligned) > > 9.If VRAM lost not hit, continue, otherwise: > > a)Update adev->*vram_lost_counter*, > > b)Iterate over all living ctx, and set all ctx as â??*guilty*â??since VRAM > lost actually ruins all VRAM contents > > c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence > status to â??*ECANCELDED*â?? > > 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do > we need it ???) > > 11.Re-schedule all JOBs remains in mirror list to ring again and > restart scheduler (for VRAM lost case, no JOB will re-scheduled) > > lFor cs_wait() IOCTL: > > After it found fence signaled, it should check with > *â??dma_fence_get_statusâ?? *to see if there is error there, > > And return the error status of fence > > lFor cs_wait_fences() IOCTL: > > Similar with above approach > > lFor cs_submit() IOCTL: > > It need to check if current ctx been marked as â??*guilty*â??and return > â??*ECANCELED*â??if so > > lIntroduce a new IOCTL to let UMD query *vram_lost_counter*: > > This way, UMD can also block app from submitting, like @Nicolai > mentioned, we can cache one copy of *vram_lost_counter* when enumerate > physical device, and deny all > > gl-context from submitting if the counter queried bigger than that one > cached in physical device. (looks a little overkill to me, but easy to > implement ) > > UMD can also return error to APP when creating gl-context if found > current queried*vram_lost_counter *bigger than that one cached in > physical device. > > BTW: I realized that gl-context is a little different with kernelâ??s > context. Because for kernel. BO is not related with context but only > with FD, while in UMD, BO have a backend > > gl-context, so block submitting in UMD layer is also needed although > KMD will do its job as bottom line > > lBasically â??vram_lost_counterâ??is exposure by kernel to let UMD take > the control of robust extension feature, it will be UMDâ??s call to > move, KMD only deny â??guiltyâ??context from submitting > > Need your feedback, thx > > Weâ??d better make TDR feature landed ASAP > > BR Monk > -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171012/efbdc2b2/attachment-0001.html>