Hi David, Agreed. I'd also like to use the first variant (UMD compares the current vram_lost count with the share list's vram_lost count) for Mesa. Cheers, Nicolai On 12.10.2017 10:43, Mao, David wrote: > Thanks Monk for the summary! > > Hi Nicolai, > In order to block the usage of new context reference the old allocation, > i think we need to do something in UMD so that KMD donâ??t need to monitor > the resource list. > I want to make sure we are on the same page. > If you agree, then there might have two options to do that in UMD: (You > can do whatever you want, i just want to elaborate the idea a little bit > to facilitate the discussion). > -  If sharelist is valid, driver need to compare the current > vram_lost_count and share listâ??s vram_lost count, The context will fail > to create if share list created before the reset. > - Or, we can copy the vram_lost count from the sharelist, kernel will > fail the submission if the vram_lost count is smaller than current one. > I personally want to go first for OrcaGL. > > Thanks. > Best Regards, > David > - >> On 12 Oct 2017, at 4:03 PM, Liu, Monk <Monk.Liu at amd.com >> <mailto:Monk.Liu at amd.com>> wrote: >> >> V2 summary >> Hi team >> *please give your comments* >> lWhen a job timed out (set from lockup_timeout kernel parameter), What >> KMD should do in TDR routine : >> 1.Update adev->*gpu_reset_counter*, and stop scheduler first >> 2.Set its fence error status toâ??*ECANCELED*â??, >> 3.Find the*context*behind this job, and set >> this*context*asâ??*guilty*â??(will have a new member field in context >> structure â??*bool guilty*) >> a)There will be â??*bool * guilty*â?? in entity structure, which points to >> its father contextâ??s member â?? â??*bool guiltyâ??*when context >> initialized**, so no matter we get context or entity, we always know >> if it is â??guiltyâ?? >> b)For kernel entity that used for VM updates, there is no context back >> it, so kernel entityâ??s â??bool *guiltyâ?? always â??NULLâ??. >> c)The idea to skip the whole context is for consistence consideration, >> because weâ??ll fake signal the hang job in job_run(), so all jobs in >> its context shall be dropped otherwise either bad drawing/computing >> results or more GPU hang. >> ** >> 4.Do GPU reset, which is can be some callbacks to let bare-metal and >> SR-IOV implement with their favor style >> 5.After reset, KMD need to aware if the VRAM lost happens or not, >> bare-metal can implement some function to judge, while for SR-IOV I >> prefer to read it from GIM side (for initial version we consider itâ??s >> always VRAM lost, till GIM side change aligned) >> 6.If VRAM lost hit, update adev->*vram_lost_counter*. >> 7.Do GTT recovery and shadow buffer recovery. >> 8.Re-schedule all JOBs in mirror list and restart scheduler >> lFor GPU scheduler function --- job_run() >> 1.Before schedule a job to ring, checks if job->*vram_lost_counter*== >> adev->*vram_lost_counter*, and drop this job if mismatch >> 2.Before schedule a job to ring, checks if job->entity->*guilty*is >> NULL or not,*and drop this job if (guilty!=NULL && *guilty == TRUE)* >> 3.if a job is dropped: >> a)set jobâ??s sched_fence status to â??*ECANCELED*â?? >> b)fake/force signal jobâ??s hw fence (no need to set hw fenceâ??s status) >> lFor cs_wait() IOCTL: >> After it found fence signaled, it should check if there is error on >> this fence and return the error status of this fence >> lFor cs_wait_fences() IOCTL: >> Similar with above approach >> lFor cs_submit() IOCTL: >> 1.check if current ctx been markedâ??*guilty*â??and returnâ??*ECANCELED*â?? if so. >> 2.set job->*vram_lost_counter*with adev->*vram_lost_counter*, and >> return â??*ECANCELED*â?? if ctx->*vram_lost_counter*!= >> job->*vram_lost_counter*(Christian already submitted this patch) >> a)discussion: can we return â??ENODEVâ?? if vram_lost_counter mismatch ? >> that way UMD know this context is under â??device lostâ?? >> lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*: >> lFor amdgpu_ctx_query(): >> n*Donâ??t update ctx->reset_counter when querying this function, >> otherwise the query result is not consistent* >> nSet out->state.reset_status to â??AMDGPU_CTX_GUILTY_RESETâ?? if the ctx >> is â??*guilty*â??, no need to check â??ctx->reset_counterâ?? >> nSet out->state.reset_status to â??AMDGPU_CTX_INNOCENT_RESETâ??*if the ctx >> isnâ??t â??guiltyâ?? && ctx->reset_counter != adev->reset_counter* >> nSet out->state.reset_status to â??AMDGPU_CTX_NO_RESETâ?? if >> ctx->reset_counter == adev->reset_counter >> nSet out->state.flags to â??AMDGPU_CTX_FLAG_VRAM_LOSTâ?? if >> ctx->vram_lost_counter != adev->vram_lost_counter >> udiscussion: can we return â??ENODEVâ?? for amdgpu_ctx_query() if >> ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know >> this context is under â??device lostâ?? >> nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or >> its flags is â??AMDGPU_CTX_FLAG_VRAM_LOSTâ?? >> For UMD behavior we still have something need to consider: >> If MESA creates a new context from an old context (share list?? Iâ??m >> not familiar with UMD , David Mao shall have some discuss on it with >> Nicolai), the new created contextâ??s vram_lost_counter >> And reset_counter shall all be ported from that old context , >> otherwise CS_SUBMIT will not block it which isnâ??t correct >> Need your feedback, thx >> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org]*On >> Behalf Of*Liu, Monk >> *Sent:*2017å¹´10æ??11æ?¥13:34 >> *To:*Koenig, Christian <Christian.Koenig at amd.com >> <mailto:Christian.Koenig at amd.com>>; Haehnle, Nicolai >> <Nicolai.Haehnle at amd.com <mailto:Nicolai.Haehnle at amd.com>>; Olsak, >> Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; Deucher, >> Alexander <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>> >> *Cc:*Ramirez, Alejandro <Alejandro.Ramirez at amd.com >> <mailto:Alejandro.Ramirez at amd.com>>;amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org>; Filipas, Mario >> <Mario.Filipas at amd.com <mailto:Mario.Filipas at amd.com>>; Ding, Pixel >> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Li, Bingley >> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Jiang, Jerry (SW) >> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>> >> *Subject:*TDR and VRAM lost handling in KMD: >> Hi Christian & Nicolai, >> We need to achieve some agreements on what should MESA/UMD do and what >> should KMD do,*please give your comments with**â??okayâ??orâ??Noâ??and your >> idea on below items,* >> lWhen a job timed out (set from lockup_timeout kernel parameter), What >> KMD should do in TDR routine : >> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >> (*gpu_reset_counter*is used to force vm flush after GPU reset, out of >> this threadâ??s scope so no more discussion on it) >> 2.Set its fence error status toâ??*ETIME*â??, >> 3.Find the entity/ctx behind this job, and set this ctx asâ??*guilty*â?? >> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t >> get re-scheduled to ring anymore. >> 5.Kick out all jobs in thisâ??guiltyâ??ctxâ??s KFIFO queue, and set all >> their fence status toâ??*ECANCELED*â?? >> *6.*Force signal all fences that get kicked out by above two >> steps,*otherwise UMD will block forever if waiting on those fences* >> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >> SR-IOV implement with their favor style >> 8.After reset, KMD need to aware if the VRAM lost happens or not, >> bare-metal can implement some function to judge, while for SR-IOV I >> prefer to read it from GIM side (for initial version we consider itâ??s >> always VRAM lost, till GIM side change aligned) >> 9.If VRAM lost not hit, continue, otherwise: >> a)Update adev->*vram_lost_counter*, >> b)Iterate over all living ctx, and set all ctx asâ??*guilty*â??since VRAM >> lost actually ruins all VRAM contents >> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence >> status toâ??*ECANCELDED*â?? >> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, do >> we need it ???) >> 11.Re-schedule all JOBs remains in mirror list to ring again and >> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >> lFor cs_wait() IOCTL: >> After it found fence signaled, it should check >> with*â??dma_fence_get_statusâ??*to see if there is error there, >> And return the error status of fence >> lFor cs_wait_fences() IOCTL: >> Similar with above approach >> lFor cs_submit() IOCTL: >> It need to check if current ctx been marked asâ??*guilty*â??and >> returnâ??*ECANCELED*â??if so >> lIntroduce a new IOCTL to let UMD query*vram_lost_counter*: >> This way, UMD can also block app from submitting, like @Nicolai >> mentioned, we can cache one copy of*vram_lost_counter*when enumerate >> physical device, and deny all >> gl-context from submitting if the counter queried bigger than that one >> cached in physical device. (looks a little overkill to me, but easy to >> implement ) >> UMD can also return error to APP when creating gl-context if found >> current queried*vram_lost_counter*bigger than that one cached in >> physical device. >> BTW: I realized that gl-context is a little different with kernelâ??s >> context. Because for kernel. BO is not related with context but only >> with FD, while in UMD, BO have a backend >> gl-context, so block submitting in UMD layer is also needed although >> KMD will do its job as bottom line >> lBasicallyâ??vram_lost_counterâ??is exposure by kernel to let UMD take the >> control of robust extension feature, it will be UMDâ??s call to move, >> KMD only denyâ??guiltyâ??context from submitting >> Need your feedback, thx >> Weâ??d better make TDR feature landed ASAP >> BR Monk >