> Some jobs don't have a context (VM updates, clears, buffer moves). What? I remember even the VM update job is with a kernel entity, (no context is true), and if entity can keep a counter copy That can solve your concerns -----Original Message----- From: Koenig, Christian Sent: Wednesday, October 11, 2017 9:39 PM To: Liu, Monk <Monk.Liu at amd.com>; Zhou, David(ChunMing) <David1.Zhou at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com> Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com> Subject: Re: TDR and VRAM lost handling in KMD: Some jobs don't have a context (VM updates, clears, buffer moves). I would still like to abort those when they where issued before a losing VRAM content, but keep the entity usable. So I think we should just keep a copy of the VRAM lost counter in the job. That also removes us from the burden of figuring out the context during job run. Regards, Christian. Am 11.10.2017 um 15:35 schrieb Liu, Monk: > I think just compare the copy from context/entity with current counter > is enough, don't see how it's better to keep another copy in JOB > > > -----Original Message----- > From: Koenig, Christian > Sent: Wednesday, October 11, 2017 6:40 PM > To: Zhou, David(ChunMing) <David1.Zhou at amd.com>; Liu, Monk > <Monk.Liu at amd.com>; Haehnle, Nicolai <Nicolai.Haehnle at amd.com>; Olsak, > Marek <Marek.Olsak at amd.com>; Deucher, Alexander > <Alexander.Deucher at amd.com> > Cc: Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; > amd-gfx at lists.freedesktop.org; Filipas, Mario <Mario.Filipas at amd.com>; > Ding, Pixel <Pixel.Ding at amd.com>; Li, Bingley <Bingley.Li at amd.com>; > Jiang, Jerry (SW) <Jerry.Jiang at amd.com> > Subject: Re: TDR and VRAM lost handling in KMD: > > I've already posted a patch for this on the mailing list. > > Basically we just copy the vram lost counter into the job and when we try to run the job we can mark it as canceled. > > Regards, > Christian. > > Am 11.10.2017 um 12:14 schrieb Chunming Zhou: >> Your summary lacks the below issue: >> >> How about the job already pushed in scheduler queue when vram is lost? >> >> >> Regards, >> David Zhou >> On 2017å¹´10æ??11æ?¥ 17:41, Liu, Monk wrote: >>> Okay, let me summary our whole idea together and see if it works: >>> >>> 1, For cs_submit, always check vram-lost_counter first and reject >>> the submit (return -ECANCLED to UMD) if ctx->vram_lost_counter != >>> adev->vram_lost_counter. That way the vram lost issue can be handled >>> >>> 2, for cs_submit we still need to check if the incoming context is >>> "AMDGPU_CTX_GUILTY_RESET" or not even if we found >>> ctx->vram_lost_counter == adev->vram_lost_counter, and we can reject >>> the submit >>> If it is "AMDGPU_CTX_GUILTY_RESET", correct ? >>> >>> 3, in gpu_reset() routine, we only mark the hang job's entity as >>> guilty (so we need to add new member in entity structure), and not >>> kick it out in gpu_reset() stage, but we need to set the context >>> behind this entity as " AMDGPU_CTX_GUILTY_RESET" >>> And if reset introduces VRAM LOST, we just update >>> adev->vram_lost_counter, but *don't* change all entity to guilty, so >>> still only the hang job's entity is "guilty" >>> After some entity marked as "guilty", we find a way to set the >>> context behind it as AMDGPU_CTX_GUILTY_RESET, because this is U/K >>> interface, we need let UMD can know that this context is wrong. >>> >>> 4, in gpu scheduler's run_job() routine, since it only reads entity, >>> so we skip job scheduling once found the entity is "guilty" >>> >>> >>> Does above sounds good ? >>> >>> >>> >>> -----Original Message----- >>> From: Haehnle, Nicolai >>> Sent: Wednesday, October 11, 2017 5:26 PM >>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>> Deucher, Alexander <Alexander.Deucher at amd.com> >>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; >>> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley >>> <Bingley.Li at amd.com>; Ramirez, Alejandro >>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> >>> Subject: Re: TDR and VRAM lost handling in KMD: >>> >>> On 11.10.2017 11:18, Liu, Monk wrote: >>>> Let's talk it simple, When vram lost hit, what's the action for >>>> amdgpu_ctx_query()/AMDGPU_CTX_OP_QUERY_STATE on other contexts (not >>>> the one trigger gpu hang) after vram lost ? do you mean we return >>>> -ENODEV to UMD ? >>> It should successfully return AMDGPU_CTX_INNOCENT_RESET. >>> >>> >>>> In cs_submit, with vram lost hit, if we don't mark all contexts as >>>> "guilty", how we block its from submitting ? can you show some >>>> implement way >>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>> return -ECANCELED; >>> >>> (where ctx->vram_lost_counter is initialized at context creation >>> time and never changed afterwards) >>> >>> >>>> BTW: the "guilty" here is a new member I want to add to context, it >>>> is not related with AMDGPU_CTX_OP_QUERY_STATE UK interface, Looks I >>>> need to unify them and only one place to mark guilty or not >>> Right, the AMDGPU_CTX_OP_QUERY_STATE handling needs to be made >>> consistent with the rest. >>> >>> Cheers, >>> Nicolai >>> >>> >>>> BR Monk >>>> >>>> -----Original Message----- >>>> From: Haehnle, Nicolai >>>> Sent: Wednesday, October 11, 2017 5:00 PM >>>> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>>> Deucher, Alexander <Alexander.Deucher at amd.com> >>>> Cc: amd-gfx at lists.freedesktop.org; Ding, Pixel >>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, >>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro >>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario <Mario.Filipas at amd.com> >>>> Subject: Re: TDR and VRAM lost handling in KMD: >>>> >>>> On 11.10.2017 10:48, Liu, Monk wrote: >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>> lost is a perfect example where the driver should report context >>>>> lost to applications with the "innocent" flag for contexts that >>>>> were idle at the time of reset. The only context(s) that should be >>>>> reported as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> ML: KMD mark all contexts as guilty is because that way we can >>>>> unify our IOCTL behavior: e.g. for IOCTL only block >>>>> â??guiltyâ??context , no need to worry about vram-lost-counter >>>>> anymore, thatâ??s a implementation style. I donâ??t think it is >>>>> related with UMD layer, >>>>> >>>>> For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement >>>>> it own â??guiltyâ?? gl-context if you want. >>>> Well, to some extent this is just semantics, but it helps to keep >>>> the terminology consistent. >>>> >>>> Most importantly, please keep the AMDGPU_CTX_OP_QUERY_STATE uapi in >>>> mind: this returns one of >>>> AMDGPU_CTX_{GUILTY,INNOCENT,UNKNOWN}_RECENT, >>>> and it must return "innocent" for contexts that are only lost due >>>> to VRAM lost without being otherwise involved in the timeout that >>>> lead to the reset. >>>> >>>> The point is that in the places where you used "guilty" it would be >>>> better to use "context lost", and then further differentiate >>>> between guilty/innocent context lost based on the details of what happened. >>>> >>>> >>>>> If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you >>>>> illustrate what rule KMD should obey to check in KMS IOCTL like >>>>> cs_sumbit ?? letâ??s see which way better >>>> if (ctx->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) >>>> return -ECANCELED; >>>> >>>> Plus similar logic for AMDGPU_CTX_OP_QUERY_STATE. >>>> >>>> Yes, it's one additional check in cs_submit. If you're worried >>>> about that (and Christian's concerns about possible issues with >>>> walking over all contexts are addressed), I suppose you could just >>>> store a per-context >>>> >>>> unsigned context_reset_status; >>>> >>>> instead of a `bool guilty`. Its value would start out as 0 >>>> (AMDGPU_CTX_NO_RESET) and would be set to the correct value during >>>> reset. >>>> >>>> Cheers, >>>> Nicolai >>>> >>>> >>>>> *From:*Haehnle, Nicolai >>>>> *Sent:* Wednesday, October 11, 2017 4:41 PM >>>>> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian >>>>> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; >>>>> Deucher, Alexander <Alexander.Deucher at amd.com> >>>>> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel >>>>> <Pixel.Ding at amd.com>; Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, >>>>> Bingley <Bingley.Li at amd.com>; Ramirez, Alejandro >>>>> <Alejandro.Ramirez at amd.com>; Filipas, Mario >>>>> <Mario.Filipas at amd.com> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> From a Mesa perspective, this almost all sounds reasonable to me. >>>>> >>>>> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), >>>>> so it's reasonable to use it. However, it /does not/ make sense to >>>>> mark idle contexts as "guilty" just because VRAM is lost. VRAM >>>>> lost is a perfect example where the driver should report context >>>>> lost to applications with the "innocent" flag for contexts that >>>>> were idle at the time of reset. The only context(s) that should be >>>>> reported as "guilty" >>>>> (or perhaps "unknown" in some cases) are the ones that were >>>>> executing at the time of reset. >>>>> >>>>> On whether the whole context is marked as guilty from a user space >>>>> perspective, it would simply be nice for user space to get >>>>> consistent answers. It would be a bit odd if we could e.g. succeed >>>>> in submitting an SDMA job after a GFX job was rejected. This would >>>>> point in favor of marking the entire context as guilty (although >>>>> that could happen lazily instead of at reset time). On the other >>>>> hand, if that's too big a burden for the kernel implementation I'm >>>>> sure we can live without it. >>>>> >>>>> Cheers, >>>>> >>>>> Nicolai >>>>> >>>>> ------------------------------------------------------------------ >>>>> - >>>>> --- >>>>> -- >>>>> >>>>> *From:*Liu, Monk >>>>> *Sent:* Wednesday, October 11, 2017 10:15:40 AM >>>>> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, >>>>> Alexander >>>>> *Cc:* amd-gfx at lists.freedesktop.org >>>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry >>>>> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario >>>>> *Subject:* RE: TDR and VRAM lost handling in KMD: >>>>> >>>>> 1.Set its fence error status to â??*ETIME*â??, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> [ML] Iâ??m okay if you insist not to use ETIME >>>>> >>>>> 1.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â?? >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just >>>>> the entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just >>>>> "bool canceled;" ? >>>>> >>>>> [ML] I think context is better than entity, because for example if >>>>> you only block entity_0 of context and allow entity_N run, that >>>>> means the dependency between entities are broken (e.g. page table >>>>> updates in >>>>> >>>>> Sdma entity pass but gfx submit in GFX entity blocked, not make >>>>> sense to me) >>>>> >>>>> Weâ??d better either block the whole context or let notâ?¦ >>>>> >>>>> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set >>>>> all their fence status to â??*ECANCELED*â?? >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> [ML] without deep thought and expritment, Iâ??m not sure the >>>>> difference between them, but kick it out in gpu_reset routine is >>>>> more efficient, >>>>> >>>>> Otherwise you need to check context/entity guilty flag in run_job >>>>> routine â?¦and you need to it for every context/entity, I donâ??t see >>>>> why >>>>> >>>>> We donâ??t just kickout all of them in gpu_reset stage â?¦. >>>>> >>>>> a)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> [ML] because I want to make KMS IOCTL rules clean, like they donâ??t >>>>> need to differentiate VRAM lost or not, they only interested in if >>>>> the context is guilty or not, and block >>>>> >>>>> Submit for guilty ones. >>>>> >>>>> *Can you give more details of your idea? And better the detail >>>>> implement in cs_submit, I want to see how you want to block submit >>>>> without checking context guilty flag* >>>>> >>>>> a)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>>>> fence status to â??*ECANCELDED*â?? >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and >>>>> not during GPU reset. >>>>> >>>>> [ML] again, kicking out them in gpu reset routine is high >>>>> efficient, otherwise you need check on every job in run_job() >>>>> >>>>> Besides, can you illustrate the detail implementation ? >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common >>>>> fence code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> [ML] yeah, thatâ??s too confusing, the name sound really the one I >>>>> want to use, we should change itâ?¦ >>>>> >>>>> *But look into the implement, I don**â??t see why we cannot use it ? >>>>> it also finally return the fence->error * >>>>> >>>>> *From:*Koenig, Christian >>>>> *Sent:* Wednesday, October 11, 2017 3:21 PM >>>>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; >>>>> Haehnle, Nicolai <Nicolai.Haehnle at amd.com >>>>> <mailto:Nicolai.Haehnle at amd.com>>; >>>>> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; >>>>> Deucher, Alexander <Alexander.Deucher at amd.com >>>>> <mailto:Alexander.Deucher at amd.com>> >>>>> *Cc:* amd-gfx at lists.freedesktop.org >>>>> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel >>>>> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry >>>>> (SW) <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, >>>>> Bingley <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, >>>>> Alejandro <Alejandro.Ramirez at amd.com >>>>> <mailto:Alejandro.Ramirez at amd.com>>; >>>>> Filipas, Mario <Mario.Filipas at amd.com >>>>> <mailto:Mario.Filipas at amd.com>> >>>>> *Subject:* Re: TDR and VRAM lost handling in KMD: >>>>> >>>>> See inline: >>>>> >>>>> Am 11.10.2017 um 07:33 schrieb Liu, Monk: >>>>> >>>>> Hi Christian & Nicolai, >>>>> >>>>> We need to achieve some agreements on what should MESA/UMD >>>>> do and >>>>> what should KMD do, *please give your comments with â??okayâ?? >>>>> or â??Noâ?? >>>>> and your idea on below items,* >>>>> >>>>> ?When a job timed out (set from lockup_timeout kernel >>>>> parameter), >>>>> What KMD should do in TDR routine : >>>>> >>>>> 1.Update adev->*gpu_reset_counter*, and stop scheduler first, >>>>> (*gpu_reset_counter* is used to force vm flush after GPU >>>>> reset, out >>>>> of this threadâ??s scope so no more discussion on it) >>>>> >>>>> Okay. >>>>> >>>>> 2.Set its fence error status to â??*ETIME*â??, >>>>> >>>>> No, as I already explained ETIME is for synchronous operation. >>>>> >>>>> In other words when we return ETIME from the wait IOCTL it would >>>>> mean that the waiting has somehow timed out, but not the job we waited for. >>>>> >>>>> Please use ECANCELED as well or some other error code when we find >>>>> that we need to distinct the timedout job from the canceled ones >>>>> (probably a good idea, but I'm not sure). >>>>> >>>>> 3.Find the entity/ctx behind this job, and set this ctx as >>>>> â??*guilty*â?? >>>>> >>>>> Not sure. Do we want to set the whole context as guilty or just >>>>> the entity? >>>>> >>>>> Setting the whole contexts as guilty sounds racy to me. >>>>> >>>>> BTW: We should use a different name than "guilty", maybe just >>>>> "bool canceled;" ? >>>>> >>>>> 4.Kick out this job from schedulerâ??s mirror list, so this >>>>> job wonâ??t >>>>> get re-scheduled to ring anymore. >>>>> >>>>> Okay. >>>>> >>>>> 5.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and >>>>> set all >>>>> their fence status to â??*ECANCELED*â?? >>>>> >>>>> Setting ECANCELED should be ok. But I think we should do this when >>>>> we try to run the jobs and not during GPU reset. >>>>> >>>>> 6.Force signal all fences that get kicked out by above two >>>>> steps,*otherwise UMD will block forever if waiting on those >>>>> fences* >>>>> >>>>> Okay. >>>>> >>>>> 7.Do gpu reset, which is can be some callbacks to let >>>>> bare-metal and >>>>> SR-IOV implement with their favor style >>>>> >>>>> Okay. >>>>> >>>>> 8.After reset, KMD need to aware if the VRAM lost happens >>>>> or not, >>>>> bare-metal can implement some function to judge, while for >>>>> SR-IOV I >>>>> prefer to read it from GIM side (for initial version we consider >>>>> itâ??s always VRAM lost, till GIM side change aligned) >>>>> >>>>> Okay. >>>>> >>>>> 9.If VRAM lost not hit, continue, otherwise: >>>>> >>>>> a)Update adev->*vram_lost_counter*, >>>>> >>>>> Okay. >>>>> >>>>> b)Iterate over all living ctx, and set all ctx as â??*guilty*â?? >>>>> since >>>>> VRAM lost actually ruins all VRAM contents >>>>> >>>>> No, that shouldn't be done by comparing the counters. Iterating >>>>> over all contexts is way to much overhead. >>>>> >>>>> c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their >>>>> fence status to â??*ECANCELDED*â?? >>>>> >>>>> Yes and no, that should be done when we try to run the jobs and >>>>> not during GPU reset. >>>>> >>>>> 10.Do GTT recovery and VRAM page tables/entries recovery >>>>> (optional, >>>>> do we need it ???) >>>>> >>>>> Yes, that is still needed. As Nicolai explained we can't be sure >>>>> that VRAM is still 100% correct even when it isn't cleared. >>>>> >>>>> 11.Re-schedule all JOBs remains in mirror list to ring again and >>>>> restart scheduler (for VRAM lost case, no JOB will >>>>> re-scheduled) >>>>> >>>>> Okay. >>>>> >>>>> ?For cs_wait() IOCTL: >>>>> >>>>> After it found fence signaled, it should check with >>>>> *â??dma_fence_get_statusâ?? *to see if there is error there, >>>>> >>>>> And return the error status of fence >>>>> >>>>> Yes and no, dma_fence_get_status() is some specific handling for >>>>> sync_file debugging (no idea why that made it into the common >>>>> fence code). >>>>> >>>>> It was replaced by putting the error code directly into the fence, >>>>> so just reading that one after waiting should be ok. >>>>> >>>>> Maybe we should fix dma_fence_get_status() to do the right thing >>>>> for this? >>>>> >>>>> ?For cs_wait_fences() IOCTL: >>>>> >>>>> Similar with above approach >>>>> >>>>> ?For cs_submit() IOCTL: >>>>> >>>>> It need to check if current ctx been marked as â??*guilty*â?? >>>>> and return >>>>> â??*ECANCELED*â?? if so >>>>> >>>>> ?Introduce a new IOCTL to let UMD query *vram_lost_counter*: >>>>> >>>>> This way, UMD can also block app from submitting, like @Nicolai >>>>> mentioned, we can cache one copy of *vram_lost_counter* when >>>>> enumerate physical device, and deny all >>>>> >>>>> gl-context from submitting if the counter queried bigger >>>>> than that >>>>> one cached in physical device. (looks a little overkill to >>>>> me, but >>>>> easy to implement ) >>>>> >>>>> UMD can also return error to APP when creating gl-context >>>>> if found >>>>> current queried*vram_lost_counter *bigger than that one >>>>> cached in >>>>> physical device. >>>>> >>>>> Okay. Already have a patch for this, please review that one if you >>>>> haven't already done so. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> BTW: I realized that gl-context is a little different with >>>>> kernelâ??s >>>>> context. Because for kernel. BO is not related with context >>>>> but only >>>>> with FD, while in UMD, BO have a backend >>>>> >>>>> gl-context, so block submitting in UMD layer is also needed >>>>> although >>>>> KMD will do its job as bottom line >>>>> >>>>> ?Basically â??vram_lost_counterâ?? is exposure by kernel to let >>>>> UMD take >>>>> the control of robust extension feature, it will be UMDâ??s >>>>> call to >>>>> move, KMD only deny â??guiltyâ?? context from submitting >>>>> >>>>> Need your feedback, thx >>>>> >>>>> Weâ??d better make TDR feature landed ASAP >>>>> >>>>> BR Monk >>>>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx