Okay, let me summary our whole idea together and see if it works: 1, For cs_submit, always check vram-lost_counter first and reject the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != adev->vram_lost_counter. That way the vram lost issue can be handled 2, for cs_submit we still need to check if the incoming context is "AMDGPU_CTX_GUILTY_RESET" or not even if we found ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject the submit If it is "AMDGPU_CTX_GUILTY_RESET", correct ? 3, in gpu_reset() routine, we only mark the hang job's entity as guilty (so we need to add new member in entity structure), and not kick it out in gpu_reset() stage, but we need to set the context behind this entity as " AMDGPU_CTX_GUILTY_RESET" And if reset introduces VRAM LOST, we just update adev->vram_lost_counter, but *don't* change all entity to guilty, so still only the hang job's entity is "guilty" After some entity marked as "guilty", we find a way to set the context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K interface, we need let UMD can know that this context is wrong. 4, in gpu scheduler's run_job() routine, since it only reads entity, so we skip job scheduling once found the entity is "guilty" Does above sounds good ? -----Original Message----- From: Haehnle, Nicolai Sent: Wednesday, October 11, 2017 5:26 PM To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> Subject: Re: TDR and VRAM lost handling in KMD: On 11.10.2017 11:18, Liu, Monk wrote: > Let's talk it simple, When vram lost hit, what's the action for amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not the one trigger gpu hang) after vram lost ? do you mean we return -ENODEV to UMD ? It should successfully return AMDGPU_CTX_INNOCENT_RESET. > In cs_submit, with vram lost hit, if we don't mark all contexts as > "guilty", how we block its from submitting ? can you show some > implement way if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) return -ECANCELED; (where ctx->vram_lost_counter is initialized at context creation time and never changed afterwards) > BTW: the "guilty" here is a new member I want to add to context, it is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, > Looks I need to unify them and only one place to mark guilty or not Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made consistent with the rest. Cheers, Nicolai > > > BR Monk > > -----Original Message----- > From: Haehnle, Nicolai > Sent: Wednesday, October 11, 2017 5:00 PM > To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> > Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > On 11.10.2017 10:48, Liu, Monk wrote: >> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >> it's reasonable to use it. However, it /does not/ make sense to mark >> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >> perfect example where the driver should report context lost to >> applications with the "innocent" flag for contexts that were idle at >> the time of reset. The only context(s) that should be reported as "guilty" >> (or perhaps "unknown" in some cases) are the ones that were executing >> at the time of reset. >> >> ML: KMD mark all contexts as guilty is because that way we can unify >> our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no >> need to worry about vram-lost-counter anymore, thatâ??s a implementation >> style. I donâ??t think it is related with UMD layer, >> >> For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement it >> own â??guiltyâ?? gl-context if you want. > > Well, to some extent this is just semantics, but it helps to keep the terminology consistent. > > Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in > mind: this returns one of AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, > and it must return "innocent" for contexts that are only lost due to VRAM lost without being otherwise involved in the timeout that lead to the reset. > > The point is that in the places where you used "guilty" it would be better to use "context lost", and then further differentiate between guilty/innocent context lost based on the details of what happened. > > >> If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you >> illustrate what rule KMD should obey to check in KMS IOCTL like >> cs_sumbit ?? letâ??s see which way better > > if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) > return -ECANCELED; > > Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. > > Yes, it's one additional check in cs_submit. If you're worried about that (and Christian's concerns about possible issues with walking over all contexts are addressed), I suppose you could just store a per-context > > unsigned context_reset_status; > > instead of a `bool guilty`. Its value would start out as 0 > (AMDGPU_CTX_NO_RESET) and would be set to the correct value during reset. > > Cheers, > Nicolai > > >> >> *From:*Haehnle, Nicolai >> *Sent:* Wednesday, October 11, 2017 4:41 PM >> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >> Deucher, Alexander <Alexander.Deucher at amd.com> >> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; >> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley >> <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; >> Filipas, Mario <Mario.Filipas at amd.com> >> *Subject:* Re: TDR and VRAM lost handling in KMD: >> >> From a Mesa perspective, this almost all sounds reasonable to me. >> >> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so >> it's reasonable to use it. However, it /does not/ make sense to mark >> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a >> perfect example where the driver should report context lost to >> applications with the "innocent" flag for contexts that were idle at >> the time of reset. The only context(s) that should be reported as "guilty" >> (or perhaps "unknown" in some cases) are the ones that were executing >> at the time of reset. >> >> On whether the whole context is marked as guilty from a user space >> perspective, it would simply be nice for user space to get consistent >> answers. It would be a bit odd if we could e.g. succeed in submitting >> an SDMA job after a GFX job was rejected. This would point in favor of >> marking the entire context as guilty (although that could happen >> lazily instead of at reset time). On the other hand, if that's too big >> a burden for the kernel implementation I'm sure we can live without it. >> >> Cheers, >> >> Nicolai >> >> ---------------------------------------------------------------------- >> -- >> >> *From:*Liu, Monk >> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >> Alexander >> *Cc:* amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >> *Subject:* RE: TDR and VRAM lost handling in KMD: >> >> 1.Set its fence error status to â??*ETIME*â??, >> >> No, as I already explained ETIME is for synchronous operation. >> >> In other words when we return ETIME from the wait IOCTL it would mean >> that the waiting has somehow timed out, but not the job we waited for. >> >> Please use ECANCELED as well or some other error code when we find >> that we need to distinct the timedout job from the canceled ones >> (probably a good idea, but I'm not sure). >> >> [ML] Iâ??m okay if you insist not to use ETIME >> >> 1.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >> >> Not sure. Do we want to set the whole context as guilty or just the entity? >> >> Setting the whole contexts as guilty sounds racy to me. >> >> BTW: We should use a different name than "guilty", maybe just "bool >> canceled;" ? >> >> [ML] I think context is better than entity, because for example if you >> only block entity_0 of context and allow entity_N run, that means the >> dependency between entities are broken (e.g. page table updates in >> >> Sdma entity pass but gfx submit in GFX entity blocked, not make sense >> to me) >> >> Weâ??d better either block the whole context or let notâ?¦ >> >> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all >> their fence status to â??*ECANCELED*â?? >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> [ML] without deep thought and expritment, Iâ??m not sure the difference >> between them, but kick it out in gpu_reset routine is more efficient, >> >> Otherwise you need to check context/entity guilty flag in run_job >> routine â?¦and you need to it for every context/entity, I donâ??t see why >> >> We donâ??t just kickout all of them in gpu_reset stage â?¦. >> >> a)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since >> VRAM lost actually ruins all VRAM contents >> >> No, that shouldn't be done by comparing the counters. Iterating over >> all contexts is way to much overhead. >> >> [ML] because I want to make KMS IOCTL rules clean, like they donâ??t >> need to differentiate VRAM lost or not, they only interested in if the >> context is guilty or not, and block >> >> Submit for guilty ones. >> >> *Can you give more details of your idea? And better the detail >> implement in cs_submit, I want to see how you want to block submit >> without checking context guilty flag* >> >> a)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence >> status to â??*ECANCELDED*â?? >> >> Yes and no, that should be done when we try to run the jobs and not >> during GPU reset. >> >> [ML] again, kicking out them in gpu reset routine is high efficient, >> otherwise you need check on every job in run_job() >> >> Besides, can you illustrate the detail implementation ? >> >> Yes and no, dma_fence_get_status() is some specific handling for >> sync_file debugging (no idea why that made it into the common fence code). >> >> It was replaced by putting the error code directly into the fence, so >> just reading that one after waiting should be ok. >> >> Maybe we should fix dma_fence_get_status() to do the right thing for this? >> >> [ML] yeah, thatâ??s too confusing, the name sound really the one I want >> to use, we should change itâ?¦ >> >> *But look into the implement, I don**â??t see why we cannot use it ? it >> also finally return the fence->error * >> >> *From:*Koenig, Christian >> *Sent:* Wednesday, October 11, 2017 3:21 PM >> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Haehnle, >> Nicolai <Nicolai.Haehnle at amd.com <mailto:Nicolai.Haehnle at amd.com>>; >> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; >> Deucher, Alexander <Alexander.Deucher at amd.com >> <mailto:Alexander.Deucher at amd.com>> >> *Cc:* amd-gfx at lists.freedesktop.org >> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel >> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry (SW) >> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, Bingley >> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, Alejandro >> <Alejandro.Ramirez at amd.com <mailto:Alejandro.Ramirez at amd.com>>; >> Filipas, Mario <Mario.Filipas at amd.com <mailto:Mario.Filipas at amd.com>> >> *Subject:* Re: TDR and VRAM lost handling in KMD: >> >> See inline: >> >> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >> >> Hi Christian & Nicolai, >> >> We need to achieve some agreements on what should MESA/UMD do and >> what should KMD do, *please give your comments with â??okayâ?? or â??Noâ?? >> and your idea on below items,* >> >> ?When a job timed out (set from lockup_timeout kernel parameter), >> What KMD should do in TDR routine : >> >> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >> (*gpu_reset_counter* is used to force vm flush after GPU reset, out >> of this threadâ??s scope so no more discussion on it) >> >> Okay. >> >> 2.Set its fence error status to â??*ETIME*â??, >> >> No, as I already explained ETIME is for synchronous operation. >> >> In other words when we return ETIME from the wait IOCTL it would mean >> that the waiting has somehow timed out, but not the job we waited for. >> >> Please use ECANCELED as well or some other error code when we find >> that we need to distinct the timedout job from the canceled ones >> (probably a good idea, but I'm not sure). >> >> 3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >> >> Not sure. Do we want to set the whole context as guilty or just the entity? >> >> Setting the whole contexts as guilty sounds racy to me. >> >> BTW: We should use a different name than "guilty", maybe just "bool >> canceled;" ? >> >> 4.Kick out this job from schedulerâ??s mirror list, so this job wonâ??t >> get re-scheduled to ring anymore. >> >> Okay. >> >> 5.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all >> their fence status to â??*ECANCELED*â?? >> >> Setting ECANCELED should be ok. But I think we should do this when we >> try to run the jobs and not during GPU reset. >> >> 6.Force signal all fences that get kicked out by above two >> steps,*otherwise UMD will block forever if waiting on those >> fences* >> >> Okay. >> >> 7.Do gpu reset, which is can be some callbacks to let bare-metal and >> SR-IOV implement with their favor style >> >> Okay. >> >> 8.After reset, KMD need to aware if the VRAM lost happens or not, >> bare-metal can implement some function to judge, while for SR-IOV I >> prefer to read it from GIM side (for initial version we consider >> itâ??s always VRAM lost, till GIM side change aligned) >> >> Okay. >> >> 9.If VRAM lost not hit, continue, otherwise: >> >> a)Update adev->*vram_lost_counter*, >> >> Okay. >> >> b)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since >> VRAM lost actually ruins all VRAM contents >> >> No, that shouldn't be done by comparing the counters. Iterating over >> all contexts is way to much overhead. >> >> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >> fence status to â??*ECANCELDED*â?? >> >> Yes and no, that should be done when we try to run the jobs and not >> during GPU reset. >> >> 10.Do GTT recovery and VRAM page tables/entries recovery (optional, >> do we need it ???) >> >> Yes, that is still needed. As Nicolai explained we can't be sure that >> VRAM is still 100% correct even when it isn't cleared. >> >> 11.Re-schedule all JOBs remains in mirror list to ring again and >> restart scheduler (for VRAM lost case, no JOB will re-scheduled) >> >> Okay. >> >> ?For cs_wait() IOCTL: >> >> After it found fence signaled, it should check with >> *â??dma_fence_get_statusâ?? *to see if there is error there, >> >> And return the error status of fence >> >> Yes and no, dma_fence_get_status() is some specific handling for >> sync_file debugging (no idea why that made it into the common fence code). >> >> It was replaced by putting the error code directly into the fence, so >> just reading that one after waiting should be ok. >> >> Maybe we should fix dma_fence_get_status() to do the right thing for this? >> >> ?For cs_wait_fences() IOCTL: >> >> Similar with above approach >> >> ?For cs_submit() IOCTL: >> >> It need to check if current ctx been marked as â??*guilty*â?? and return >> â??*ECANCELED*â?? if so >> >> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >> >> This way, UMD can also block app from submitting, like @Nicolai >> mentioned, we can cache one copy of *vram_lost_counter* when >> enumerate physical device, and deny all >> >> gl-context from submitting if the counter queried bigger than that >> one cached in physical device. (looks a little overkill to me, but >> easy to implement ) >> >> UMD can also return error to APP when creating gl-context if found >> current queried*vram_lost_counter *bigger than that one cached in >> physical device. >> >> Okay. Already have a patch for this, please review that one if you >> haven't already done so. >> >> Regards, >> Christian. >> >> BTW: I realized that gl-context is a little different with kernelâ??s >> context. Because for kernel. BO is not related with context but only >> with FD, while in UMD, BO have a backend >> >> gl-context, so block submitting in UMD layer is also needed although >> KMD will do its job as bottom line >> >> ?Basically â??vram_lost_counterâ?? is exposure by kernel to let UMD take >> the control of robust extension feature, it will be UMDâ??s call to >> move, KMD only deny â??guiltyâ?? context from submitting >> >> Need your feedback, thx >> >> Weâ??d better make TDR feature landed ASAP >> >> BR Monk >> >