TDR and VRAM lost handling in KMD:

christian.koenig@xxxxxxx (Christian König) · Wed, 11 Oct 2017 11:02:47 +0200

> [ML] I think context is better than entity, because for example if you 
> only block entity_0 of context and allow entity_N run, that means the 
> dependency between entities are broken (e.g. page table updates in
>
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense 
> to me)
>
> Weâ??d better either block the whole context or let notâ?¦
>
Page table updates are not part of any context.

So I think the only thing we can do is to mark the entity as not 
scheduled any more.

> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all 
> their fence status to â??*ECANCELED*â??
>
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
>
> [ML] without deep thought and expritment, Iâ??m not sure the difference 
> between them, but kick it out in gpu_reset routine is more efficient,
>
I really don't think so. Kicking them out during gpu_reset sounds racy 
to me once more.

And marking them canceled when we try to run them has the clear 
advantage that all dependencies are meet first.

> ML: KMD mark all contexts as guilty is because that way we can unify 
> our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no 
> need to worry about vram-lost-counter anymore, thatâ??s a implementation 
> style. I donâ??t think it is related with UMD layer,
>
I don't think that this is a good idea. Instead when you want to unify 
the behavior we should use the vram_lost_counter as marker for the 
guilty context.

Regards,
Christian.

Am 11.10.2017 um 10:48 schrieb Liu, Monk:
>
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at 
> the time of reset. The only context(s) that should be reported as 
> "guilty" (or perhaps "unknown" in some cases) are the ones that were 
> executing at the time of reset.
>
> ML: KMD mark all contexts as guilty is because that way we can unify 
> our IOCTL behavior: e.g. for IOCTL only block â??guiltyâ??context , no 
> need to worry about vram-lost-counter anymore, thatâ??s a implementation 
> style. I donâ??t think it is related with UMD layer,
>
> For UMD the gl-context isnâ??t aware of by KMD, so UMD can implement it 
> own â??guiltyâ?? gl-context if you want.
>
> If KMD doesnâ??t mark all ctx as guilty after VRAM lost, can you 
> illustrate what rule KMD should obey to check in KMS IOCTL like 
> cs_sumbit ?? letâ??s see which way better
>
> *From:*Haehnle, Nicolai
> *Sent:* Wednesday, October 11, 2017 4:41 PM
> *To:* Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian 
> <Christian.Koenig at amd.com>; Olsak, Marek <Marek.Olsak at amd.com>; 
> Deucher, Alexander <Alexander.Deucher at amd.com>
> *Cc:* amd-gfx at lists.freedesktop.org; Ding, Pixel <Pixel.Ding at amd.com>; 
> Jiang, Jerry (SW) <Jerry.Jiang at amd.com>; Li, Bingley 
> <Bingley.Li at amd.com>; Ramirez, Alejandro <Alejandro.Ramirez at amd.com>; 
> Filipas, Mario <Mario.Filipas at amd.com>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
>
> From a Mesa perspective, this almost all sounds reasonable to me.
>
> On "guilty": "guilty" is a term that's used by APIs (e.g. OpenGL), so 
> it's reasonable to use it. However, it /does not/ make sense to mark 
> idle contexts as "guilty" just because VRAM is lost. VRAM lost is a 
> perfect example where the driver should report context lost to 
> applications with the "innocent" flag for contexts that were idle at 
> the time of reset. The only context(s) that should be reported as 
> "guilty" (or perhaps "unknown" in some cases) are the ones that were 
> executing at the time of reset.
>
> On whether the whole context is marked as guilty from a user space 
> perspective, it would simply be nice for user space to get consistent 
> answers. It would be a bit odd if we could e.g. succeed in submitting 
> an SDMA job after a GFX job was rejected. This would point in favor of 
> marking the entire context as guilty (although that could happen 
> lazily instead of at reset time). On the other hand, if that's too big 
> a burden for the kernel implementation I'm sure we can live without it.
>
> Cheers,
>
> Nicolai
>
> ------------------------------------------------------------------------
>
> *From:*Liu, Monk
> *Sent:* Wednesday, October 11, 2017 10:15:40 AM
> *To:* Koenig, Christian; Haehnle, Nicolai; Olsak, Marek; Deucher, 
> Alexander
> *Cc:* amd-gfx at lists.freedesktop.org 
> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel; Jiang, Jerry 
> (SW); Li, Bingley; Ramirez, Alejandro; Filipas, Mario
> *Subject:* RE: TDR and VRAM lost handling in KMD:
>
> 1.Set its fence error status to â??*ETIME*â??,
>
> No, as I already explained ETIME is for synchronous operation.
>
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
>
> Please use ECANCELED as well or some other error code when we find 
> that we need to distinct the timedout job from the canceled ones 
> (probably a good idea, but I'm not sure).
>
> [ML] Iâ??m okay if you insist not to use ETIME
>
> 1.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â??
>
> Not sure. Do we want to set the whole context as guilty or just the 
> entity?
>
> Setting the whole contexts as guilty sounds racy to me.
>
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
>
> [ML] I think context is better than entity, because for example if you 
> only block entity_0 of context and allow entity_N run, that means the 
> dependency between entities are broken (e.g. page table updates in
>
> Sdma entity pass but gfx submit in GFX entity blocked, not make sense 
> to me)
>
> Weâ??d better either block the whole context or let notâ?¦
>
> 1.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set all 
> their fence status to â??*ECANCELED*â??
>
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
>
> [ML] without deep thought and expritment, Iâ??m not sure the difference 
> between them, but kick it out in gpu_reset routine is more efficient,
>
> Otherwise you need to check context/entity guilty flag in run_job 
> routine â?¦and you need to it for every context/entity, I donâ??t see why
>
> We donâ??t just kickout all of them in gpu_reset stage â?¦.
>
> a)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since 
> VRAM lost actually ruins all VRAM contents
>
> No, that shouldn't be done by comparing the counters. Iterating over 
> all contexts is way to much overhead.
>
> [ML] because I want to make KMS IOCTL rules clean, like they donâ??t 
> need to differentiate VRAM lost or not, they only interested in if the 
> context is guilty or not, and block
>
> Submit for guilty ones.
>
> *Can you give more details of your idea? And better the detail 
> implement in cs_submit, I want to see how you want to block submit 
> without checking context guilty flag*
>
> a)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their fence 
> status to â??*ECANCELDED*â??
>
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
>
> [ML] again, kicking out them in gpu reset routine is high efficient, 
> otherwise you need check on every job in run_job()
>
> Besides, can you illustrate the detail implementation ?
>
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
>
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
>
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>
> [ML] yeah, thatâ??s too confusing, the name sound really the one I want 
> to use, we should change itâ?¦
>
> *But look into the implement, I don**â??t see why we cannot use it ? it 
> also finally return the fence->error *
>
> *From:*Koenig, Christian
> *Sent:* Wednesday, October 11, 2017 3:21 PM
> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Haehnle, 
> Nicolai <Nicolai.Haehnle at amd.com <mailto:Nicolai.Haehnle at amd.com>>; 
> Olsak, Marek <Marek.Olsak at amd.com <mailto:Marek.Olsak at amd.com>>; 
> Deucher, Alexander <Alexander.Deucher at amd.com 
> <mailto:Alexander.Deucher at amd.com>>
> *Cc:* amd-gfx at lists.freedesktop.org 
> <mailto:amd-gfx at lists.freedesktop.org>; Ding, Pixel 
> <Pixel.Ding at amd.com <mailto:Pixel.Ding at amd.com>>; Jiang, Jerry (SW) 
> <Jerry.Jiang at amd.com <mailto:Jerry.Jiang at amd.com>>; Li, Bingley 
> <Bingley.Li at amd.com <mailto:Bingley.Li at amd.com>>; Ramirez, Alejandro 
> <Alejandro.Ramirez at amd.com <mailto:Alejandro.Ramirez at amd.com>>; 
> Filipas, Mario <Mario.Filipas at amd.com <mailto:Mario.Filipas at amd.com>>
> *Subject:* Re: TDR and VRAM lost handling in KMD:
>
> See inline:
>
> Am 11.10.2017 um 07:33 schrieb Liu, Monk:
>
>     Hi Christian & Nicolai,
>
>     We need to achieve some agreements on what should MESA/UMD do and
>     what should KMD do, *please give your comments with â??okayâ?? or â??Noâ??
>     and your idea on below items,*
>
>     ?When a job timed out (set from lockup_timeout kernel parameter),
>     What KMD should do in TDR routine :
>
>     1.Update adev->*gpu_reset_counter*, and stop scheduler first,
>     (*gpu_reset_counter* is used to force vm flush after GPU reset,
>     out of this threadâ??s scope so no more discussion on it)
>
> Okay.
>
>     2.Set its fence error status to â??*ETIME*â??,
>
> No, as I already explained ETIME is for synchronous operation.
>
> In other words when we return ETIME from the wait IOCTL it would mean 
> that the waiting has somehow timed out, but not the job we waited for.
>
> Please use ECANCELED as well or some other error code when we find 
> that we need to distinct the timedout job from the canceled ones 
> (probably a good idea, but I'm not sure).
>
>     3.Find the entity/ctx behind this job, and set this ctx as â??*guilty*â??
>
> Not sure. Do we want to set the whole context as guilty or just the 
> entity?
>
> Setting the whole contexts as guilty sounds racy to me.
>
> BTW: We should use a different name than "guilty", maybe just "bool 
> canceled;" ?
>
>     4.Kick out this job from schedulerâ??s mirror list, so this job
>     wonâ??t get re-scheduled to ring anymore.
>
> Okay.
>
>     5.Kick out all jobs in this â??guiltyâ?? ctxâ??s KFIFO queue, and set
>     all their fence status to â??*ECANCELED*â??
>
> Setting ECANCELED should be ok. But I think we should do this when we 
> try to run the jobs and not during GPU reset.
>
>     6.Force signal all fences that get kicked out by above two
>     steps,*otherwise UMD will block forever if waiting on those fences*
>
> Okay.
>
>     7.Do gpu reset, which is can be some callbacks to let bare-metal
>     and SR-IOV implement with their favor style
>
> Okay.
>
>     8.After reset, KMD need to aware if the VRAM lost happens or not,
>     bare-metal can implement some function to judge, while for SR-IOV
>     I prefer to read it from GIM side (for initial version we consider
>     itâ??s always VRAM lost, till GIM side change aligned)
>
> Okay.
>
>     9.If VRAM lost not hit, continue, otherwise:
>
>     a)Update adev->*vram_lost_counter*,
>
> Okay.
>
>     b)Iterate over all living ctx, and set all ctx as â??*guilty*â?? since
>     VRAM lost actually ruins all VRAM contents
>
> No, that shouldn't be done by comparing the counters. Iterating over 
> all contexts is way to much overhead.
>
>     c)Kick out all jobs in all ctxâ??s KFIFO queue, and set all their
>     fence status to â??*ECANCELDED*â??
>
> Yes and no, that should be done when we try to run the jobs and not 
> during GPU reset.
>
>     10.Do GTT recovery and VRAM page tables/entries recovery
>     (optional, do we need it ???)
>
> Yes, that is still needed. As Nicolai explained we can't be sure that 
> VRAM is still 100% correct even when it isn't cleared.
>
>     11.Re-schedule all JOBs remains in mirror list to ring again and
>     restart scheduler (for VRAM lost case, no JOB will re-scheduled)
>
> Okay.
>
>     ?For cs_wait() IOCTL:
>
>     After it found fence signaled, it should check with
>     *â??dma_fence_get_statusâ?? *to see if there is error there,
>
>     And return the error status of fence
>
> Yes and no, dma_fence_get_status() is some specific handling for 
> sync_file debugging (no idea why that made it into the common fence code).
>
> It was replaced by putting the error code directly into the fence, so 
> just reading that one after waiting should be ok.
>
> Maybe we should fix dma_fence_get_status() to do the right thing for this?
>
>     ?For cs_wait_fences() IOCTL:
>
>     Similar with above approach
>
>     ?For cs_submit() IOCTL:
>
>     It need to check if current ctx been marked as â??*guilty*â?? and
>     return â??*ECANCELED*â?? if so
>
>     ?Introduce a new IOCTL to let UMD query *vram_lost_counter*:
>
>     This way, UMD can also block app from submitting, like @Nicolai
>     mentioned, we can cache one copy of *vram_lost_counter* when
>     enumerate physical device, and deny all
>
>     gl-context from submitting if the counter queried bigger than that
>     one cached in physical device. (looks a little overkill to me, but
>     easy to implement )
>
>     UMD can also return error to APP when creating gl-context if found
>     current queried*vram_lost_counter *bigger than that one cached in
>     physical device.
>
> Okay. Already have a patch for this, please review that one if you 
> haven't already done so.
>
> Regards,
> Christian.
>
>     BTW: I realized that gl-context is a little different with
>     kernelâ??s context. Because for kernel. BO is not related with
>     context but only with FD, while in UMD, BO have a backend
>
>     gl-context, so block submitting in UMD layer is also needed
>     although KMD will do its job as bottom line
>
>     ?Basically â??vram_lost_counterâ?? is exposure by kernel to let UMD
>     take the control of robust extension feature, it will be UMDâ??s
>     call to move, KMD only deny â??guiltyâ?? context from submitting
>
>     Need your feedback, thx
>
>     Weâ??d better make TDR feature landed ASAP
>
>     BR Monk
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171011/c674e13b/attachment-0001.html>