Yeah, that was exactly my thinking as well. Christian. Am 11.10.2017 um 15:59 schrieb Liu, Monk: > But if we keep counter in entity, there is one issue I suddenly though of : > > For regular user context, after vram lost UMD will aware this context is LOST since we have a counter copy in context, so user space can close it and re-create one > But for kernel entity, since no U/K interface, so it is kernel's responsibility to recover this kernel entity to work, that make things complicated .... > > Emm, I agree that keep a copy in context and in job is a good move > > BR Monk > > -----Original Message----- > From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of Liu, Monk > Sent: Wednesday, October 11, 2017 9:51 PM > To: Koenig, Christian <Christian.Koenig at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> > Subject: RE: TDR and VRAM lost handling in KMD: > >> Some jobs don't have a context (VM updates, clears, buffer moves). > What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns > > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 9:39 PM > To: Liu, Monk <Monk.Liu at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > Some jobs don't have a context (VM updates, clears, buffer moves). > > I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. > > So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. > > Regards, > Christian. > > Am 11.10.2017 um 15:35 schrieb Liu, Monk: >> I think just compare the copy from context/entity with current counter >> is enough, don't see how it's better to keep another copy in JOB >> >> >> -----Original Message----- >> From: Koenig, Christian >> Sent: Wednesday, October 11, 2017 6:40 PM >> To: Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liu, Monk >> <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, >> Marek <Marek.Olsak at amd.com>; Deucher, Alexander >> <Alexander.Deucher at amd.com> >> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; >> Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> I've already posted a patch for this on the mailing list. >> >> Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. >> >> Regards, >> Christian. >> >> Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >>> Your summary lacks the below issue: >>> >>> How about the job already pushed in scheduler queue when vram is lost? >>> >>> >>> Regards, >>> David Zhou >>> On 2017å¹´10æ??11æ?¥ 17:41, Liu, Monk wrote: >>>> Okay, let me summary our whole idea together and see if it works: >>>> >>>> 1, For cs_submit, always check vram-lost_counter first and reject >>>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>>> adev->vram_lost_counter. That way the vram lost issue can be handled >>>> >>>> 2, for cs_submit we still need to check if the incoming context is >>>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>>> the submit >>>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>>> >>>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>>> guilty (so we need to add new member in entity structure), and not >>>> kick it out in gpu_reset() stage, but we need to set the context >>>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>>> And if reset introduces VRAM LOST, we just update >>>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>>> still only the hang job's entity is "guilty" >>>> After some entity marked as "guilty", we find a way to set the >>>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>>> interface, we need let UMD can know that this context is wrong. >>>> >>>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>>> so we skip job scheduling once found the entity is "guilty" >>>> >>>> >>>> Does above sounds good ? >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:26 PM >>>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>>> Deucher, Alexander <Alexander.Deucher at amd.com> >>>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; >>>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley >>>> <Bingley.Li at amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 11:18, Liu, Monk wrote: >>>>> Let's talk it simple, When vram lost hit, what's the action for >>>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>>> -ENODEV to UMD ? >>>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>>> >>>> >>>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>>> "guilty", how we block its from submitting ? can you show some >>>>> implement way >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> (where ctx->vram_lost_counter is initialized at context creation >>>> time and never changed afterwards) >>>> >>>> >>>>> BTW: the "guilty" here is a new member I want to add to context, it >>>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>>> need to unify them and only one place to mark guilty or not >>>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>>> consistent with the rest. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> BR Monk >>>>> >>>>> -----Original Message----- >>>>> From: Haehnle, Nicolai >>>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher at amd.com> >>>>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, >>>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> >>>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>>> lost is a perfect example where the driver should report context >>>>>> lost to applications with the "innocent" flag for contexts that >>>>>> were idle at the time of reset. The only context(s) that should be >>>>>> reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>>> â??guiltyâ??context , no need to worry about vram-lost-counter >>>>>> anymore, thatâ??s a implementation style. I donâ??t think it is >>>>>> related with UMD layer, >>>>>> >>>>>> For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement >>>>>> it own â??guiltyâ?? gl-context if you want. >>>>> Well, to some extent this is just semantics, but it helps to keep >>>>> the terminology consistent. >>>>> >>>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>>> mind: this returns one of >>>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>>> and it must return "innocent" for contexts that are only lost due >>>>> to VRAM lost without being otherwise involved in the timeout that >>>>> lead to the reset. >>>>> >>>>> The point is that in the places where you used "guilty" it would be >>>>> better to use "context lost", and then further differentiate >>>>> between guilty/innocent context lost based on the details of what happened. >>>>> >>>>> >>>>>> If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you >>>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>>> cs_sumbit ?? letâ??s see which way better >>>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>>> return -ECANCELED; >>>>> >>>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>>> >>>>> Yes, it's one additional check in cs_submit. If you're worried >>>>> about that (and Christian's concerns about possible issues with >>>>> walking over all contexts are addressed), I suppose you could just >>>>> store a per-context >>>>> >>>>> unsigned context_reset_status; >>>>> >>>>> instead of a `bool guilty`. Its value would start out as 0 >>>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>>> reset. >>>>> >>>>> Cheers, >>>>> Nicolai >>>>> >>>>> >>>>>> *From:*Haehnle, Nicolai >>>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>>> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>>>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>>>>> Deucher, Alexander <Alexander.Deucher at amd.com> >>>>>> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel >>>>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, >>>>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro >>>>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario >>>>>> <Mario.Filipas at amd.com> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>>> >>>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>>> lost is a perfect example where the driver should report context >>>>>> lost to applications with the "innocent" flag for contexts that >>>>>> were idle at the time of reset. The only context(s) that should be >>>>>> reported as "guilty" >>>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>>> executing at the time of reset. >>>>>> >>>>>> On whether the whole context is marked as guilty from a user space >>>>>> perspective, it would simply be nice for user space to get >>>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>>> point in favor of marking the entire context as guilty (although >>>>>> that could happen lazily instead of at reset time). On the other >>>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>>> sure we can live without it. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Nicolai >>>>>> >>>>>> ------------------------------------------------------------------ >>>>>> - >>>>>> --- >>>>>> -- >>>>>> >>>>>> *From:*Liu, Monk >>>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>>> Alexander >>>>>> *Cc:* amd-gfx at lists.freedesktop.org >>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> 1.Set its fence error status to â??*ETIME*â??, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we find >>>>>> that we need to distinct the timedout job from the canceled ones >>>>>> (probably a good idea, but I'm not sure). >>>>>> >>>>>> [ML] Iâ??m okay if you insist not to use ETIME >>>>>> >>>>>> 1.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> [ML] I think context is better than entity, because for example if >>>>>> you only block entity_0 of context and allow entity_N run, that >>>>>> means the dependency between entities are broken (e.g. page table >>>>>> updates in >>>>>> >>>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>>> sense to me) >>>>>> >>>>>> Weâ??d better either block the whole context or let notâ?¦ >>>>>> >>>>>> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set >>>>>> all their fence status to â??*ECANCELED*â?? >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>>> we try to run the jobs and not during GPU reset. >>>>>> >>>>>> [ML] without deep thought and expritment, Iâ??m not sure the >>>>>> difference between them, but kick it out in gpu_reset routine is >>>>>> more efficient, >>>>>> >>>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>>> routine â?¦and you need to it for every context/entity, I donâ??t see >>>>>> why >>>>>> >>>>>> We donâ??t just kickout all of them in gpu_reset stage â?¦. >>>>>> >>>>>> a)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> [ML] because I want to make KMS IOCTL rules clean, like they donâ??t >>>>>> need to differentiate VRAM lost or not, they only interested in if >>>>>> the context is guilty or not, and block >>>>>> >>>>>> Submit for guilty ones. >>>>>> >>>>>> *Can you give more details of your idea? And better the detail >>>>>> implement in cs_submit, I want to see how you want to block submit >>>>>> without checking context guilty flag* >>>>>> >>>>>> a)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>>>>> fence status to â??*ECANCELDED*â?? >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> [ML] again, kicking out them in gpu reset routine is high >>>>>> efficient, otherwise you need check on every job in run_job() >>>>>> >>>>>> Besides, can you illustrate the detail implementation ? >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the fence, >>>>>> so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> [ML] yeah, thatâ??s too confusing, the name sound really the one I >>>>>> want to use, we should change itâ?¦ >>>>>> >>>>>> *But look into the implement, I don**â??t see why we cannot use it ? >>>>>> it also finally return the fence->error * >>>>>> >>>>>> *From:*Koenig, Christian >>>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; >>>>>> Haehnle, Nicolai <Nicolai.Haehnle at amd.com >>>>>> <mailto:Nicolai.Haehnle at amd.com>>; >>>>>> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; >>>>>> Deucher, Alexander <Alexander.Deucher at amd.com >>>>>> <mailto:Alexander.Deucher at amd.com>> >>>>>> *Cc:* amd-gfx at lists.freedesktop.org >>>>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel >>>>>> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry >>>>>> (SW) <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, >>>>>> Bingley <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, >>>>>> Alejandro <Alejandro.Ramirez at amd.com >>>>>> <mailto:Alejandro.Ramirez at amd.com>>; >>>>>> Filipas, Mario <Mario.Filipas at amd.com >>>>>> <mailto:Mario.Filipas at amd.com>> >>>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>>> >>>>>> See inline: >>>>>> >>>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>>> >>>>>> Hi Christian & Nicolai, >>>>>> >>>>>> We need to achieve some agreements on what should MESA/UMD >>>>>> do and >>>>>> what should KMD do, *please give your comments with â??okayâ?? >>>>>> or â??Noâ?? >>>>>> and your idea on below items,* >>>>>> >>>>>> ?When a job timed out (set from lockup_timeout kernel >>>>>> parameter), >>>>>> What KMD should do in TDR routine : >>>>>> >>>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>>> reset, out >>>>>> of this threadâ??s scope so no more discussion on it) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 2.Set its fence error status to â??*ETIME*â??, >>>>>> >>>>>> No, as I already explained ETIME is for synchronous operation. >>>>>> >>>>>> In other words when we return ETIME from the wait IOCTL it would >>>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>>> >>>>>> Please use ECANCELED as well or some other error code when we find >>>>>> that we need to distinct the timedout job from the canceled ones >>>>>> (probably a good idea, but I'm not sure). >>>>>> >>>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>>> â??*guilty*â?? >>>>>> >>>>>> Not sure. Do we want to set the whole context as guilty or just >>>>>> the entity? >>>>>> >>>>>> Setting the whole contexts as guilty sounds racy to me. >>>>>> >>>>>> BTW: We should use a different name than "guilty", maybe just >>>>>> "bool canceled;" ? >>>>>> >>>>>> 4.Kick out this job from schedulerâ??s mirror list, so this >>>>>> job wonâ??t >>>>>> get re-scheduled to ring anymore. >>>>>> >>>>>> Okay. >>>>>> >>>>>> 5.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and >>>>>> set all >>>>>> their fence status to â??*ECANCELED*â?? >>>>>> >>>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>>> we try to run the jobs and not during GPU reset. >>>>>> >>>>>> 6.Force signal all fences that get kicked out by above two >>>>>> steps,*otherwise UMD will block forever if waiting on those >>>>>> fences* >>>>>> >>>>>> Okay. >>>>>> >>>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>>> bare-metal and >>>>>> SR-IOV implement with their favor style >>>>>> >>>>>> Okay. >>>>>> >>>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>>> or not, >>>>>> bare-metal can implement some function to judge, while for >>>>>> SR-IOV I >>>>>> prefer to read it from GIM side (for initial version we consider >>>>>> itâ??s always VRAM lost, till GIM side change aligned) >>>>>> >>>>>> Okay. >>>>>> >>>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>>> >>>>>> a)Update adev->*vram_lost_counter*, >>>>>> >>>>>> Okay. >>>>>> >>>>>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â?? >>>>>> since >>>>>> VRAM lost actually ruins all VRAM contents >>>>>> >>>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>>> over all contexts is way to much overhead. >>>>>> >>>>>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>>>>> fence status to â??*ECANCELDED*â?? >>>>>> >>>>>> Yes and no, that should be done when we try to run the jobs and >>>>>> not during GPU reset. >>>>>> >>>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>>> (optional, >>>>>> do we need it ???) >>>>>> >>>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>>> >>>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>>> restart scheduler (for VRAM lost case, no JOB will >>>>>> re-scheduled) >>>>>> >>>>>> Okay. >>>>>> >>>>>> ?For cs_wait() IOCTL: >>>>>> >>>>>> After it found fence signaled, it should check with >>>>>> *â??dma_fence_get_statusâ?? *to see if there is error there, >>>>>> >>>>>> And return the error status of fence >>>>>> >>>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>>> sync_file debugging (no idea why that made it into the common >>>>>> fence code). >>>>>> >>>>>> It was replaced by putting the error code directly into the fence, >>>>>> so just reading that one after waiting should be ok. >>>>>> >>>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>>> for this? >>>>>> >>>>>> ?For cs_wait_fences() IOCTL: >>>>>> >>>>>> Similar with above approach >>>>>> >>>>>> ?For cs_submit() IOCTL: >>>>>> >>>>>> It need to check if current ctx been marked as â??*guilty*â?? >>>>>> and return >>>>>> â??*ECANCELED*â?? if so >>>>>> >>>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>>> >>>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>>> enumerate physical device, and deny all >>>>>> >>>>>> gl-context from submitting if the counter queried bigger >>>>>> than that >>>>>> one cached in physical device. (looks a little overkill to >>>>>> me, but >>>>>> easy to implement ) >>>>>> >>>>>> UMD can also return error to APP when creating gl-context >>>>>> if found >>>>>> current queried*vram_lost_counter *bigger than that one >>>>>> cached in >>>>>> physical device. >>>>>> >>>>>> Okay. Already have a patch for this, please review that one if you >>>>>> haven't already done so. >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>> BTW: I realized that gl-context is a little different with >>>>>> kernelâ??s >>>>>> context. Because for kernel. BO is not related with context >>>>>> but only >>>>>> with FD, while in UMD, BO have a backend >>>>>> >>>>>> gl-context, so block submitting in UMD layer is also needed >>>>>> although >>>>>> KMD will do its job as bottom line >>>>>> >>>>>> ?Basically â??vram_lost_counterâ?? is exposure by kernel to let >>>>>> UMD take >>>>>> the control of robust extension feature, it will be UMDâ??s >>>>>> call to >>>>>> move, KMD only deny â??guiltyâ?? context from submitting >>>>>> >>>>>> Need your feedback, thx >>>>>> >>>>>> Weâ??d better make TDR feature landed ASAP >>>>>> >>>>>> BR Monk >>>>>> >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx at lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx