According to the initial summary, this situation is already considered: When vram lost hit, all context marked as guilty, and all jobs in guilty context's KFIFO queue will be kicked out Now if we move the kick out from gpu_reset to run_job, then I think your question can be answered by: in run_job(), before each job scheduling, check current vram_lost_counter, compare it with the copy cached during entity init (or context init), skip the job if not match BR Monk -----Original Message----- From: Zhou, David(ChunMing) Sent: Wednesday, October 11, 2017 6:15 PM To: Liu, Monk <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> Subject: Re: TDR and VRAM lost handling in KMD: Your summary lacks the below issue: How about the job already pushed in scheduler queue when vram is lost? Regards, David Zhou On 2017å¹´10æ??11æ?¥ 17:41, Liu, Monk wrote: > Okay, let me summary our whole idea together and see if it works: > > 1, For cs_submit, always check vram-lost_counter first and reject the > submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != > adev->vram_lost_counter. That way the vram lost issue can be handled > > 2, for cs_submit we still need to check if the incoming context is > "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit If it is "AMDGPU_CTX_GUILTY_RESET", correct ? > > 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET" > And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty" > After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong. > > 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty" > > > Does above sounds good ? > > > > -----Original Message----- > From: Haehnle, Nicolai > Sent: Wednesday, October 11, 2017 5:26 PM > To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian > <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; > Deucher, Alexander <Alexander.Deucher at amd.com> > Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley > <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; > Filipas, Mario <Mario.Filipas at amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > On 11.10.2017 11:18, Liu, Monk wrote: >> Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? > It should successfully return AMDGPU_CTX_INNOCENT_RESET. > > >> In cs_submit, with vram lost hit, if we don't mark all contexts as >> "guilty", how we block its from submitting ? can you show some >> implement way > if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) > return -ECANCELED; > > (where ctx->vram_lost_counter is initialized at context creation time > and never changed afterwards) > > >> BTW: the "guilty" here is a new member I want to add to context, it >> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >> need to unify them and only one place to mark guilty or not > Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made > consistent with the rest. > > Cheers, > Nicolai > > >> >> BR Monk >> >> -----Original Message----- >> From: Haehnle, Nicolai >> Sent: Wednesday, October 11, 2017 5:00 PM >> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >> Deucher, Alexander <Alexander.Deucher at amd.com> >> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley >> <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> Filipas, Mario <Mario.Filipas at amd.com> >> Subject: Re: TDR and VRAM lost handling in KMD: >> >> On 11.10.2017 10:48, Liu, Monk wrote: >>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>> so it's reasonable to use it. However, it /does not/ make sense to >>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>> is a perfect example where the driver should report context lost to >>> applications with the "innocent" flag for contexts that were idle at >>> the time of reset. The only context(s) that should be reported as "guilty" >>> (or perhaps "unknown" in some cases) are the ones that were >>> executing at the time of reset. >>> >>> ML: KMD mark all contexts as guilty is because that way we can unify >>> our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no >>> need to worry about vram-lost-counter anymore, thatâ??s a >>> implementation style. I donâ??t think it is related with UMD layer, >>> >>> For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement >>> it own â??guiltyâ?? gl-context if you want. >> Well, to some extent this is just semantics, but it helps to keep the terminology consistent. >> >> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >> mind: this returns one of >> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >> and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. >> >> The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. >> >> >>> If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you >>> illustrate what rule KMD should obey to check in KMS IOCTL like >>> cs_sumbit ?? letâ??s see which way better >> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >> return -ECANCELED; >> >> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >> >> Yes, it's one additional check in cs_submit. If you're worried about >> that (and Christian's concerns about possible issues with walking >> over all contexts are addressed), I suppose you could just store a >> per-context >> >> unsigned context_reset_status; >> >> instead of a `bool guilty`. Its value would start out as 0 >> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. >> >> Cheers, >> Nicolai >> >> >>> *From:*Haehnle, Nicolai >>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>> Deucher, Alexander <Alexander.Deucher at amd.com> >>> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel >>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, >>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> >>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>> >>> From a Mesa perspective, this almost all sounds reasonable to me. >>> >>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>> so it's reasonable to use it. However, it /does not/ make sense to >>> mark idle contexts as "guilty" just because VRAM is lost. VRAM lost >>> is a perfect example where the driver should report context lost to >>> applications with the "innocent" flag for contexts that were idle at >>> the time of reset. The only context(s) that should be reported as "guilty" >>> (or perhaps "unknown" in some cases) are the ones that were >>> executing at the time of reset. >>> >>> On whether the whole context is marked as guilty from a user space >>> perspective, it would simply be nice for user space to get >>> consistent answers. It would be a bit odd if we could e.g. succeed >>> in submitting an SDMA job after a GFX job was rejected. This would >>> point in favor of marking the entire context as guilty (although >>> that could happen lazily instead of at reset time). On the other >>> hand, if that's too big a burden for the kernel implementation I'm sure we can live without it. >>> >>> Cheers, >>> >>> Nicolai >>> >>> -------------------------------------------------------------------- >>> -- >>> -- >>> >>> *From:*Liu, Monk >>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>> Alexander >>> *Cc:* amd-gfx at lists.freedesktop.org >>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>> >>> 1.Set its fence error status to â??*ETIME*â??, >>> >>> No, as I already explained ETIME is for synchronous operation. >>> >>> In other words when we return ETIME from the wait IOCTL it would >>> mean that the waiting has somehow timed out, but not the job we waited for. >>> >>> Please use ECANCELED as well or some other error code when we find >>> that we need to distinct the timedout job from the canceled ones >>> (probably a good idea, but I'm not sure). >>> >>> [ML] Iâ??m okay if you insist not to use ETIME >>> >>> 1.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >>> >>> Not sure. Do we want to set the whole context as guilty or just the entity? >>> >>> Setting the whole contexts as guilty sounds racy to me. >>> >>> BTW: We should use a different name than "guilty", maybe just "bool >>> canceled;" ? >>> >>> [ML] I think context is better than entity, because for example if >>> you only block entity_0 of context and allow entity_N run, that >>> means the dependency between entities are broken (e.g. page table >>> updates in >>> >>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>> sense to me) >>> >>> Weâ??d better either block the whole context or let notâ?¦ >>> >>> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all >>> their fence status to â??*ECANCELED*â?? >>> >>> Setting ECANCELED should be ok. But I think we should do this when >>> we try to run the jobs and not during GPU reset. >>> >>> [ML] without deep thought and expritment, Iâ??m not sure the >>> difference between them, but kick it out in gpu_reset routine is >>> more efficient, >>> >>> Otherwise you need to check context/entity guilty flag in run_job >>> routine â?¦and you need to it for every context/entity, I donâ??t see >>> why >>> >>> We donâ??t just kickout all of them in gpu_reset stage â?¦. >>> >>> a)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since >>> VRAM lost actually ruins all VRAM contents >>> >>> No, that shouldn't be done by comparing the counters. Iterating over >>> all contexts is way to much overhead. >>> >>> [ML] because I want to make KMS IOCTL rules clean, like they donâ??t >>> need to differentiate VRAM lost or not, they only interested in if >>> the context is guilty or not, and block >>> >>> Submit for guilty ones. >>> >>> *Can you give more details of your idea? And better the detail >>> implement in cs_submit, I want to see how you want to block submit >>> without checking context guilty flag* >>> >>> a)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>> fence status to â??*ECANCELDED*â?? >>> >>> Yes and no, that should be done when we try to run the jobs and not >>> during GPU reset. >>> >>> [ML] again, kicking out them in gpu reset routine is high efficient, >>> otherwise you need check on every job in run_job() >>> >>> Besides, can you illustrate the detail implementation ? >>> >>> Yes and no, dma_fence_get_status() is some specific handling for >>> sync_file debugging (no idea why that made it into the common fence code). >>> >>> It was replaced by putting the error code directly into the fence, >>> so just reading that one after waiting should be ok. >>> >>> Maybe we should fix dma_fence_get_status() to do the right thing for this? >>> >>> [ML] yeah, thatâ??s too confusing, the name sound really the one I >>> want to use, we should change itâ?¦ >>> >>> *But look into the implement, I don**â??t see why we cannot use it ? >>> it also finally return the fence->error * >>> >>> *From:*Koenig, Christian >>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; >>> Haehnle, Nicolai <Nicolai.Haehnle at amd.com >>> <mailto:Nicolai.Haehnle at amd.com>>; >>> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; >>> Deucher, Alexander <Alexander.Deucher at amd.com >>> <mailto:Alexander.Deucher at amd.com>> >>> *Cc:* amd-gfx at lists.freedesktop.org >>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel >>> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry (SW) >>> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, Bingley >>> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, Alejandro >>> <Alejandro.Ramirez at amd.com <mailto:Alejandro.Ramirez at amd.com>>; >>> Filipas, Mario <Mario.Filipas at amd.com >>> <mailto:Mario.Filipas at amd.com>> >>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>> >>> See inline: >>> >>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>> >>> Hi Christian & Nicolai, >>> >>> We need to achieve some agreements on what should MESA/UMD do and >>> what should KMD do, *please give your comments with â??okayâ?? or â??Noâ?? >>> and your idea on below items,* >>> >>> ?When a job timed out (set from lockup_timeout kernel parameter), >>> What KMD should do in TDR routine : >>> >>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >>> of this threadâ??s scope so no more discussion on it) >>> >>> Okay. >>> >>> 2.Set its fence error status to â??*ETIME*â??, >>> >>> No, as I already explained ETIME is for synchronous operation. >>> >>> In other words when we return ETIME from the wait IOCTL it would >>> mean that the waiting has somehow timed out, but not the job we waited for. >>> >>> Please use ECANCELED as well or some other error code when we find >>> that we need to distinct the timedout job from the canceled ones >>> (probably a good idea, but I'm not sure). >>> >>> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >>> >>> Not sure. Do we want to set the whole context as guilty or just the entity? >>> >>> Setting the whole contexts as guilty sounds racy to me. >>> >>> BTW: We should use a different name than "guilty", maybe just "bool >>> canceled;" ? >>> >>> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t >>> get re-scheduled to ring anymore. >>> >>> Okay. >>> >>> 5.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all >>> their fence status to â??*ECANCELED*â?? >>> >>> Setting ECANCELED should be ok. But I think we should do this when >>> we try to run the jobs and not during GPU reset. >>> >>> 6.Force signal all fences that get kicked out by above two >>> steps,*otherwise UMD will block forever if waiting on those >>> fences* >>> >>> Okay. >>> >>> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >>> SR-IOV implement with their favor style >>> >>> Okay. >>> >>> 8.After reset, KMD need to aware if the VRAM lost happens or not, >>> bare-metal can implement some function to judge, while for SR-IOV I >>> prefer to read it from GIM side (for initial version we consider >>> itâ??s always VRAM lost, till GIM side change aligned) >>> >>> Okay. >>> >>> 9.If VRAM lost not hit, continue, otherwise: >>> >>> a)Update adev->*vram_lost_counter*, >>> >>> Okay. >>> >>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since >>> VRAM lost actually ruins all VRAM contents >>> >>> No, that shouldn't be done by comparing the counters. Iterating over >>> all contexts is way to much overhead. >>> >>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>> fence status to â??*ECANCELDED*â?? >>> >>> Yes and no, that should be done when we try to run the jobs and not >>> during GPU reset. >>> >>> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >>> do we need it ???) >>> >>> Yes, that is still needed. As Nicolai explained we can't be sure >>> that VRAM is still 100% correct even when it isn't cleared. >>> >>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>> restart scheduler (for VRAM lost case, no JOB will >>> re-scheduled) >>> >>> Okay. >>> >>> ?For cs_wait() IOCTL: >>> >>> After it found fence signaled, it should check with >>> *â??dma_fence_get_statusâ?? *to see if there is error there, >>> >>> And return the error status of fence >>> >>> Yes and no, dma_fence_get_status() is some specific handling for >>> sync_file debugging (no idea why that made it into the common fence code). >>> >>> It was replaced by putting the error code directly into the fence, >>> so just reading that one after waiting should be ok. >>> >>> Maybe we should fix dma_fence_get_status() to do the right thing for this? >>> >>> ?For cs_wait_fences() IOCTL: >>> >>> Similar with above approach >>> >>> ?For cs_submit() IOCTL: >>> >>> It need to check if current ctx been marked as â??*guilty*â?? and return >>> â??*ECANCELED*â?? if so >>> >>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>> >>> This way, UMD can also block app from submitting, like @Nicolai >>> mentioned, we can cache one copy of *vram_lost_counter* when >>> enumerate physical device, and deny all >>> >>> gl-context from submitting if the counter queried bigger than that >>> one cached in physical device. (looks a little overkill to me, but >>> easy to implement ) >>> >>> UMD can also return error to APP when creating gl-context if found >>> current queried*vram_lost_counter *bigger than that one cached in >>> physical device. >>> >>> Okay. Already have a patch for this, please review that one if you >>> haven't already done so. >>> >>> Regards, >>> Christian. >>> >>> BTW: I realized that gl-context is a little different with kernelâ??s >>> context. Because for kernel. BO is not related with context but only >>> with FD, while in UMD, BO have a backend >>> >>> gl-context, so block submitting in UMD layer is also needed although >>> KMD will do its job as bottom line >>> >>> ?Basically â??vram_lost_counterâ?? is exposure by kernel to let UMD take >>> the control of robust extension feature, it will be UMDâ??s call to >>> move, KMD only deny â??guiltyâ?? context from submitting >>> >>> Need your feedback, thx >>> >>> Weâ??d better make TDR feature landed ASAP >>> >>> BR Monk >>> > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx